Deep Learning Interviews is home to hundreds of fully-solved problems, 
from a wide range of key topics in Al. It is designed to both rehearse 
interview or exam-specific topics and provide machine learning M.Sc./Ph.D. 
students, and those awaiting an interview a well-organized overview of the 
field. The problems it poses are tough enough to cut your teeth on and to 
dramatically improve your skills-but they're framed within thought- 
provoking questions and engaging stories. 


That is what makes the volume so specifically valuable to students and job 
seekers: it provides them with the ability to speak confidently and quickly on 
any relevant topic, to answer technical questions clearly and correctly, and to 
fully understand the purpose and meaning of interview questions and 
answers. These are powerful, indispensable advantages to have when walking 
into the interview room. 


The book’s contents is a large inventory of numerous topics relevant to DL 
job interviews and graduate-level exams. That places this work at the 
forefront of the growing trend in science to teach a core set of practical 
mathematical and computational skills. It is widely accepted that the 
training of every computer scientist must include the fundamental theorems 
of ML, and Al appears in the curriculum of nearly every university. This 
volume is designed as an excellent reference for graduates of such programs. 
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FOREWORD. 


We will build a machine that will fly. 


— Joseph Michael Montgolfier, French Inventor / Aeronaut (1740-1810) 


Sy] EEP learning interviews are technical, dense, and thanks to the fields com- 
petitiveness, often high-stakes. The prospect of preparing for one can be 
4] daunting, and the fear of failure can be paralyzing and many interviewees 
find their ideas slipping away alongside their confidence. 

This book was written for you: an aspiring data scientist with a quantitative back- 
ground, facing down the gauntlet of the interview process in an increasingly competit- 
ive field. For most of you, the interview process is the most significant hurdle between 
you and a dream job. Even though you have the ability, the background, and the mo- 
tivation to excel in your target position, you might need some guidance on how to get 
your foot in the door. 

Though this book is highly technical it is not too dense to work through quickly. It 
aims to be comprehensive, including many of the terms and topics involved in modern 
data science and deep learning. That thoroughness makes it unique; no other single 
work offers such breadth of learning targeted so specifically at the demands of the 
interview. 

Most comparable information is available in a variety of formats, locations, struc- 
tures, and resourcesblog posts, tech articles, and short books scattered across the inter- 
net. Those resources are simply not adequate to the demands of deep learning inter- 
view or exam preparation and were not assembled with this explicit purpose in mind. 
It is hoped that this book does not suffer the same shortcomings. 


iS, signed for use by job seekers in the fields of machine learning and deep 
AA learning whose abilities and background locate them firmly within STEM 
(science, technology, engineering, and mathematics). The book will still be of use to 
other readers, such as those still undergoing their initial education in a STEM field. 
However, it is tailored most directly to the needs of active job seekers and stu- 
dents attending M.Sc/Ph.D programmes in AI. It is, in any case, a book for engineers, 
mathematicians, and computer scientists: nowhere does it include the kind of very 


basic background material that would allow it to be read by someone with no prior 


knowledge of quantitative and mathematical processes. 

The books contents are a large inventory of numerous topics relevant to deep learn- 
ing job interviews and graduate level exams. Ideas that are interesting or pertinent 
have been excluded if they are not valuable in that context. That places this work at 
the forefront of the growing trend in education and in business to emphasize a core 
set of practical mathematical and computational skills. It is now widely understood 
that the training of every computer scientist must include a course dealing with the 
fundamental theorems of machine learning in a rigorous manner; Deep Learning ap- 
pears in the curriculum of nearly every university; and this volume is designed as a 
convenient ongoing reference for graduates of such courses and programs. 

The book is grounded in both academic expertise and on-the-job experience and 
thus has two goals. First, it compresses all of the necessary information into a coher- 
ent package. And second, it renders that information accessible and makes it easy to 
navigate. As a result, the book helps the reader develop a thorough understanding of 
the principles and concepts underlying practical data science. None of the textbooks I 
read met all of those needs, which are: 


1. Appropriate presentation level. I wanted a friendly introductory text accessible 
to graduate students who have not had extensive applied experience as data 
scientists. 


2. A text that is rigorous and builds a solid understanding of the subject without 
getting bogged down in too many technicalities. 


3. Logical and notational consistency among topics. There are intimate connec- 
tions between calculus, logistic regression, entropy, and deep learning theory, 
which I feel need to be emphasized and elucidated if the reader is to fully under- 
stand the field. Differences in notation and presentation style in existing sources 
make it very difficult for students to appreciate these kinds of connections. 


4. Manageable size. It is very useful to have a text compact enough that all of the 
material in it can be covered in few weeks or months of intensive review. Most 
candidates will have only that much time to prepare for an interview, so a longer 
text is of no use to them. 


The text that follows is an attempt to meet all of the above challenges. It will 
inevitably prove more successful at handling some of them than others, but it 
has at least made a sincere and devoted effort. 


A note about Bibliography 
The book provides a carefully curated bibliography to guide further study, whether 
for interview preparation or simply as a matter of interest or job-relevant research. A 
comprehensive bibliography would be far too long to include here, and would be of 
little immediate use, so the selections have been made with deliberate attention to the 
value of each included text. 

Only the most important books and articles on each topic have been included, and 
only those written in English that I personally consulted. Each is given a brief annota- 
tion to indicate its scope and applicability. Many of the works cited will be found to 
include very full bibliographies of the particular subject treated, and 1 recommend 
turning there if you wish to dive deeper into a specific topic, method, or process. 


We have a web page for this book, where we list errata, examples, and any ad- 
ditional information. You can access this page at: http: //www.interviews.al. 
To comment or ask technical questions about this book, send email to: entropyl 
interviews. AL. 

I would also like to solicit corrections, criticisms, and suggestions from students 
and other readers. Although I have tried to eliminate errors over the multi year 
process of writing and revising this text, a few undoubtedly remain. In particular, 
some typographical infelicities will no doubt find their way into the final version. I 
hope you will forgive them. 


THE AUTHOR. 
TEL AVIV ISRAEL, DECEMBER, 2020. FIRST PRINTING, DECEMBER 2020. 
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gression, perceptrons, and convolutional neural networks. 
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CHAPTER 


| HOW-TO USE THIS BOOK 


The true logic of this world is in the calculus of probabilities. 


— James C. Maxwell 
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1.1 Introduction 


First of all, welcome to world of Deep Learning Interviews. 


1.1.1 What makes this book so valuable 


9| ARGETED advertising. Deciphering dead languages. Detecting malignant 
| tumours. Predicting natural disasters. Every year we see dozens of new 
uses for deep learning emerge from corporate R&R, academia, and plucky 
=! entrepreneurs. Increasingly, deep learning and artificial intelligence are in- 
grained in our cultural consciousness. Leading universities are dedicating programs 
to teaching them, and they make the headlines every few days. 
That means jobs. It means intense demand and intense competition. It means a 
generation of data scientists and machine learning engineers making their way into 


1.1. INTRODUCTION 


the workforce and using deep learning to change how things work. This book is for 
them, and for you. It is aimed at current or aspiring experts and students in the field 
possessed of a strong grounding in mathematics, an active imagination, engaged cre- 
ativity, and an appreciation for data. It is hand-tailored to give you the best possible 
preparation for deep learning job interviews by guiding you through hundreds of 
fully solved questions. 

That is what makes the volume so specifically valuable to students and job seekers: 
it provides them with the ability to speak confidently and quickly on any relevant 
topic, to answer technical questions clearly and correctly, and to fully understand the 
purpose and meaning of interview questions and answers. 


Those are powerful, indispensable advantages to have when walking into the in- 
terview room. 

The questions and problems the book poses are tough enough to cut your teeth 
on-and to dramatically improve your skills but theyre framed within thought provok- 
ing questions, powerful and engaging stories, and cutting edge scientific information. 
What are bosons and fermions? What is choriionic villus? Where did the Ebola virus 
first appear, and how does it spread? Why is binary options trading so dangerous? 

Your curiosity will pull you through the book’s problem sets, formulas, and in- 
structions, and as you progress, you'll deepen your understanding of deep learning. 
There are intricate connections between calculus, logistic regression, entropy, and deep 
learning theory; work through the book, and those connections will feel intuitive. 


1.1.2 What will I learn 


Starting Your Career 
Are you actively pursuing a career in deep learning and data science, or hoping to do 
so? If so, you're in luck everything from deep learning to artificial intelligence is in 
extremely high demand in the contemporary workforce. Deep learning professionals 
are highly sought after and also find themselves among the highest-paid employee 
groups in companies around the world. 

So your career choice is spot on, and the financial and intellectual benefits of land- 
ing a solid job are tremendous. But those positions have a high barrier to entry: the 
deep learning interview. These interviews have become their own tiny industry, with 
HR employees having to specialize in the relevant topics so as to distinguish well- 
prepared job candidates from those who simply have a loose working knowledge of 
the material. Outside the interview itself, the difference doesn't always feel import- 
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ant. Deep learning libraries are so good that a machine learning pipeline can often be 
assembled with little high-skill input from the researcher themselves. But that level 
of ability won't cut it in the interview. You'll be asked practical questions, technical 
questions, and theoretical questions, and expected to answer them all confidently and 
fluently. 

For unprepared candidates, that's the end of the road. Many give up after repeated 
post-interview rejections. 


Advancing Your Career 
Some of you will be more confident. Those of you with years on the job will be highly 
motivated, exceptionally numerate, and prepared to take an active, hands-on role in 
deep learning projects. You probably already have extensive knowledge in applied 
mathematics, computer science, statistics, and economics. Those are all formidable 
advantages. 

But at the same time, it's unlikely that you will have prepared for the interview 
itself. Deep learning interviews especially those for the most interesting, autonom- 
ous, and challenging positions demand that you not only know how to do your job 
but that you display that knowledge clearly, eloquently, and without hesitation. Some 
questions will be straightforward and familiar, but others might be farther afield or 
draw on areas you haven't encountered since college. 

There is simply no reason to leave that kind of thing to chance. Make sure you're 
prepared. Confirm that you are up-to-date on terms, concepts, and algorithms. Refresh 
your memory of fundamentals, and how they inform contemporary research practices. 
And when the interview comes, walk into the room knowing that you're ready for 
what's coming your way. 


Diving Into Deep Learning 

"Deep Learning Job Interviews" is organized into chapters that each consist of an Intro- 
duction to a topic, Problems illustrating core aspects of the topic, and complete Solu- 
tions. You can expect each question and problem in this volume to be clear, practical, 
and relevant to the subject. Problems fall into two groups, conceptual and application- 
based. Conceptual problems are aimed at testing and improving your knowledge of 
basic underlying concepts, while applications are targeted at practicing or applying 
what you've learned (most of these are relevant to Python and PyTorch). The chapters 
are followed by a reference list of relevant formulas and a selective bibliography for 
guide further reading. 
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1.1. INTRODUCTION 


1.1.3 How to Work Problems 


In real life, like in exams, you will encounter problems of varying difficulty. A good 
skill to practice is recognizing the level of difficulty a problem poses. Job interviews 
will have some easy problems, some standard problems, and some much harder prob- 
lems. 

Each chapter of this book is usually organized into three sections: Introduction, 
Problems, and Solutions. As you are attempting to tackle problems, resist the tempta- 
tion to prematurely peek at the solution; It is vital to allow yourself to struggle for 
a time with the material. Even professional data scientists do not always know right 
away how to resolve a problem. The art is in gathering your thoughts and figuring 
out a strategy to use what you know to find out what you don't. 


PRB-1 O CH.PRE- 1.1. 

Problems outlined in grey make up the representative question set. This set of prob- 
lems is intended to cover the most essential ideas in each section. These problems are usually 
highly typical of what you'd see on an interview, although some of them are atypical but 
carry an important moral. If you find yourself unconfident with the idea behind one of these, 
it’s probably a good idea to practice similar problems. This representative question set is our 
suggestion for a minimal selection of problems to work on. You are highly encouraged to 
work on more. 


bala CH.SOL- 1.1. I am a solution. m 


If you find yourself at a real stand-off, go ahead and look for a clue in one of the 
recommended theory books. Think about it for a while, and don’t be afraid to read 
back in the notes to look for a key idea that will help you proceed. If you still can’t 
solve the problem, well, we included the Solutions section for a reason! As you're 
reading the solutions, try hard to understand why we took the steps we did, instead 
of memorizing step-by-step how to solve that one particular problem. 

If you struggled with a question quite a lot, it's probably a good idea to return to it 
in a few days. That might have been enough time for you to internalize the necessary 
ideas, and you might find it easily conquerable. If you're still having troubles, read 
over the solution again, with an emphasis on understanding why each step makes 
sense. One of the reasons so many job candidates are required to demonstrate their 
ability to resolves data science problems on the board, is that it hiring managers as- 
sume it reflects their true problem-solving skills. 


Chapter 1 HOW-TO USE THIS BOOK 


In this volume, you will learn lots of concepts, and be asked to apply them in 
a variety of situations. Often, this will involve answering one really big problem by 
breaking it up into manageable chunks, solving those chunks, then putting the pieces 
back together. When you see a particularly long question, remain calm and look for a 
way to break it into pieces you can handle. 


1.1.4 Types of Problems 


Two main types of problems are presented in this book. 

CONCEPTUAL: The first category is meant to test and improve your understanding 
of basic underlying concepts. These often involve many mathematical calculations. 
They range in difficulty from very basic reviews of definitions to problems that require 
you to be thoughtful about the concepts covered in the section. 

An example in Information Theory follows. 


PRB-2 O CH.PRE- 1.2. 
What is the distribution of maximum entropy, that is, the distribution which has the 
maximum entropy among all distributions on the bounded interval (a, b],(—00, +00) 


SOL-2 Y CH.SOL- 1.2. 
The uniform distribution has the maximum entropy among all distributions on the 
bounded interval: |a, b|,(—oo, +00). 
The variance of U (a,b) is a? = 1/12(b — a)’. 
Therefore the entropy is: 


1/2 log 12 + log o. (1.1) 


APPLICATION: Problems in this category are for practicing skills. It’s not enough to 
understand the philosophical grounding of an idea: you have to be able to apply it in 
appropriate situations. This takes practice! mostly in Python or in one of the available 
Deep Learning Libraries such as PyTorch. 

An example in PyTorch follows. 
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PRB-3 O CH.PRB- 1.3. 
Describe in your own words, what is the purpose of the following code in the context of 
training a Convolutional Neural Network. 


self.transforms = [] 
if rotate: 

self.transforms.append (RandomRotate()) 
TEMO 

self.transforms.append (RandomFlip()) 


DD A U N m 


SOL-3 @ CH.SOL- 1.3. 

During the training of a Convolutional Neural Network, data augmentation, and to some 
extent dropout are used as core methods to decrease overfitting. Data augmentation is a regu- 
larization scheme that synthetically expands the data-set by utilizing label-preserving trans- 

ormations to add more invariant examples of the same data samples. It is most commonly 
performed in real time on the CPU during the training phase whilst the actual training mode 
takes place on the GPU. This may consist for instance, random rotations, random flips, zoom- 
ing, spatial translations etc. m 


KINDERGARTEN 


CHAPTER 


LOGISTIC REGRESSION 


You should call it entropy for two reasons. In the first place, your uncertainty 
function has been used in statistical mechanics under that name. In the second 
place, and more importantly, no one knows what entropy really is, so in a debate 
you will always have the advantage. 


— John von Neumann to Claude Shannon 
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2.1. INTRODUCTION 


2.1 Introduction 


py ¿| Ultivariable methods are routinely utilized in statistical analyses across a 
d| wide range of domains. Logistic regression is the most frequently used 
method for modelling binary response data and binary classification. 

mi the response variable is binary, it characteristically takes the form of 1/0, 
with 1 normally indicating a success and 0 a failure. Multivariable methods usually 
assume a relationship between two or more independent, predictor variables, and 
one dependent, response variable. The predicted value of a response variable may be 
expressed as a sum of products, wherein each product is formed by multiplying the 
value of the variable and its coefficient. How the coefficients are computed? from a 
respective data set. Logistic regression is heavily used in supervised machine learning 
and has become the workhorse for both binary and multiclass classification problems. 
Many of the questions introduced in this chapter are crucial for truly understanding 
the inner-workings of artificial neural networks. 


2.2 Problems 
2.2.1 General Concepts 


PRB-4 O CH.PRB- 2.1. 

True or False: For a fixed number of observations in a data set, introducing more vari- 
ables normally generates a model that has a better fit to the data. What may be the drawback 
of such a model fitting strategy? 


PRB-5 @ CH.PRE- 2.2. 
Define the term “odds of success” both qualitatively and formally. Give a numerical 
example that stresses the relation between probability and odds of an event occurring. 


PRB-6 O CH.PRB- 2.3. 


1. Define what is meant by the term “interaction”, in the context of a logistic regression 
predictor variable. 


12 


Chapter 2 LOGISTIC REGRESSION 


2. What is the simplest form of an interaction? Write its formulae. 


3. What statistical tests can be used to attest the significance of an interaction term? 


PRB-7 O CH.PRE- 2.4. 

True or False: In machine learning terminology, unsupervised learning refers to the 
mapping of input covariates to a target response variable that is attempted at being predicted 
when the labels are known. 


PRB-8 @ CH.PRB- 2.5. 
Complete the following sentence: In the case of logistic regression, the response vari- 
able is the log of the odds of being classified in [...]. 


PRB-9 @ CH.PRB- 2.6. 

Describe how in a logistic regression model, a transformation to the response variable is 
applied to yield a probability distribution. Why is it considered a more informative repres- 
entation of the response? 


PRB-10 € CH.PRB- 2.7. 
Complete the following sentence: Minimizing the negative log likelihood also means 
maximizing the [...] of selecting the [...] class. 


2.2.2 Odds, Log-odds 


PRB-11 O CH.PRE- 2.8. 
Assume the probability of an event occurring is p = 0.1. 


1. What are the odds of the event occurring?. 


2. What are the log-odds of the event occurring?. 
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3. Construct the probability of the event as a ratio that equals 0.1. 


PRB-12 O CH.PRB- 2.9. 
True or False: If the odds of success in a binary response is 4, the corresponding probab- 
ility of success is 0.8. 


PRB-13 @ CH.PRB- 2.10. 
Draw a graph of odds to probabilities, mapping the entire range of probabilities to 
their respective odds. 


PRB-14 @ CH.PRB- 2.11. 

The logistic regression model is a subset of a broader range of machine learning models 
known as generalized linear models (GLMs), which also include analysis of variance (AN- 
OVA), vanilla linear regression, etc. There are three components to a GLM; identify these 
three components for binary logistic regression. 


PRB-15 @ CH.PRB- 2.12. 
Let us consider the logit transformation, i.e., log-odds. Assume a scenario in which the 
logit forms the linear decision boundary: 


Pr(Y =1/X 
log (Fa) = bo + O° X, (2.1) 
Y == 


for a given vector of systematic components X and predictor variables 0. Write the mathem- 
atical expression for the hyperplane that describes the decision boundary. 


PRB-16 € CH.PRE- 2.13. 
True or False: The logit function and the natural logistic (sigmoid) function are inverses 
of each other. 
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2.2.3 The Sigmoid 


The sigmoid (Fig. 2.1) also known as the logistic function, is widely used in binary 
classification and as a neuron activation function in artificial neural networks. 


X 


—1,0 —0,8 —0,6 —0,4 —0,2 0,2 0,4 0,6 0,8 1,0 


FIGURE 2.1: Examples of two sigmoid functions. 


PRB-17 @ CH.PRB- 2.14. 
Compute the derivative of the natural sigmoid function: 


1 
olt) = ins € (0,1). (2.2) 


PRB-18 @ CH.PRB- 2.15. 
Remember that in logistic regression, the hypothesis function for some parameter vector 
p and measurement vector x is defined as: 


1 
T = 
PO) ee re 


P(y = 1/2; 8), (2.3) 


h(x) 
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where y holds the hypothesis value. 

Suppose the coefficients of a logistic regression model with independent variables are as 
follows: By = —1.5, B1 = 3, 62 = —0.5. 
Assume additionally, that we have an observation with the following values for the dependent 
variables: xı = 1, 2 = 5. As a result, the logit equation becomes: 


logit = Bo + 1x1 + Box. (2.4) 


1. What is the value of the logit for this observation? 
2. What is the value of the odds for this observation? 


3. What is the value of P(y = 1) for this observation? 


2.2.4 Truly Understanding Logistic Regression 


PRB-19 € CH.PRB- 2.16. 
Proton therapy (PT) [2] is a widely adopted form of treatment for many types of cancer 
including breast and lung cancer (Fig. 2.2). 


FIGURE 2.2: Pulmonary nodules (left) and breast cancer (right). 


A PT device which was not properly calibrated is used to simulate the treatment of 
cancer. As a result, the PT beam does not behave normally. A data scientist collects inform- 
ation relating to this simulation. The covariates presented in Table 2.1 are collected during 


16 


Chapter 2 LOGISTIC REGRESSION 


the experiment. The columns Yes and No indicate if the tumour was eradicated or not, re- 
spectively. 


Tumour eradication 


Cancer Type Yes No 
Breast 560 260 
Lung 69 36 


TABLE 2.1: Tumour eradication statistics. 


Referring to Table 2.1: 


1. What is the explanatory variable and what is the response variable? 
2. Explain the use of relative risk and odds ratio for measuring association. 


3. Are the two variables positively or negatively associated? 
Find the direction and strength of the association using both relative risk and odds 
ratio. 


4. Compute a 95% confidence interval (CI) for the measure of association. 


5. Interpret the results and explain their significance. 


PRB-20 @ CH.PRE- 2.17. 

Consider a system for radiation therapy planning (Fig. 2.3). Given a patient with a ma- 
lignant tumour, the problem is to select the optimal radiation exposure time for that patient. 
A key element in this problem is estimating the probability that a given tumour will be erad- 
icated given certain covariates. A data scientist collects information relating to this radiation 
therapy system. 


17 | 


2.2. PROBLEMS 


Bb O 


r 


FIGURE 2.3: A multi-detector positron scanner used to locate tumours. 


The following covariates are collected; X, denotes time in milliseconds that a patient is 
irradiated with, X2 = holds the size of the tumour in centimeters, and Y notates a binary re- 
sponse variable indicating if the tumour was eradicated. Assume that each response’ variable 
Y; is a Bernoulli random variable with success parameter p;, which holds: 


ef0t Piri t+B2x2 
(2.5) 


Pi = 1 + efo+B1x1+Paza ` 


The data scientist fits a logistic regression model to the dependent measurements and pro- 
duces these estimated coefficients: 


Bo = —6, 
By = 0.05, (2.6) 
Bo = 1. 


1. Estimate the probability that, given a patient who undergoes the treatment for 40 
milliseconds and who is presented with a tumour sized 3.5 centimetres, the system 
eradicates the tumour. 


2. How many milliseconds the patient in part (a) would need to be radiated with to have 
exactly a 50% chance of eradicating the tumour? 
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PRB-21 @ CH.PRE- 2.18. 
Recent research [3] suggests that heating mercury containing dental amalgams may 
cause the release of toxic mercury fumes into the human airways. It is also presumed that 


drinking hot coffee, stimulates the release of mercury vapour from amalgam fillings (Fig. 
2.4). 


FIGURE 2.4: A dental amalgam. 


To study factors that affect migraines, and in particular, patients who have at least four 
dental amalgams in their mouth, a data scientist collects data from 200K users with and 
without dental amalgams. The data scientist then fits a logistic regression model with an 
indicator of a second migraine within a time frame of one hour after the onset of the first mi- 
graine, as the binary response variable (e.g., migraine=1, no migraine=0). The data scientist 
believes that the frequency of migraines may be related to the release of toxic mercury fumes. 

There are two independent variables: 


1. Xı =1 if the patient has at least four amalgams; 0 otherwise. 


2. Xə = coffee consumption (0 to 100 hot cups per month). 


The output from training a logistic regression classifier is as follows: 


Analysis of LR Parameter Estimates 


Parameter Estimate Std.Err Z-val Pr>|Z| 
Intercept -6.36347 3.21362 -1.980 0.0477 
$X_1$ -1.02411 1.17101 -0.875 0.3818 
SX_25 0.11904 0.05497 2.165 0.0304 
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1. Using X, and Xə, express the odds of a patient having a migraine for a second time. 


2. Calculate the probability of a second migraine for a patient that has at least four 
amalgams and drank 100 cups per month? 


3. For users that have at least four amalgams, is high coffee intake associated with an 
increased probability of a second migraine? 


4. Is there statistical evidence that having more than four amalgams is directly associ- 
ated with a reduction in the probability of a second migraine? 


PRB-22 @ CH.PRB- 2.19. 

To study factors that affect Alzheimer’s disease using logistic regression, a researcher 
considers the link between gum (periodontal) disease and Alzheimer as a plausible risk factor 
[1]. The predictor variable is a count of gum bacteria (Fig. 2.5) in the mouth. 


FIGURE 2.5: A chain of spherical bacteria. 


The response variable, Y, measures whether the patient shows any remission (e.g. yes=1). 
The output from training a logistic regression classifier is as follows: 


Parameter DF Estimate Std 
Intercept 1 -4.8792 1.2197 
gum bacteria 1 0.0258 0.0194 


1. Estimate the probability of improvement when the count of gum bacteria of a patient 
is 33. 


20 


Chapter 2 LOGISTIC REGRESSION 


2. Find out the gum bacteria count at which the estimated probability of improvement is 
0.5. 


3. Find out the estimated odds ratio of improvement for an increase of 1 in the total gum 
bacteria count. 


4. Obtain a 99% confidence interval for the true odds ratio of improvement increase of 
1 in the total gum bacteria count. Remember that the most common confidence levels 
are 90%, 95%, 99%, and 99.9%. Table 9.1 lists the z values for these levels. 


Confidence Level z 
90% 1.645 
95% 1.960 
99% 2.576 
99.9% 3.291 


TABLE 2.2: Common confidence levels. 


PRB-23 @ CH.PRB- 2.20. 
Recent research [4] suggests that cannabis (Fig. 2.6) and cannabinoids administration 
in particular, may reduce the size of malignant tumours in rats. 


FIGURE 2.6: Cannabis. 
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To study factors affecting tumour shrinkage, a deep learning researcher collects data from 
two groups; one group is administered with placebo (a substance that is not medicine) and 
the other with cannabinoids. His main research revolves around studying the relationship 
(Table 2.3) between the anticancer properties of cannabinoids and tumour shrinkage: 


Tumour Shrinkage In Rats 


Group Yes No Sum 
Cannabinoids 60 6833 6893 
Placebo 130 6778 6909 
Sum 190 13611 13801 


TABLE 2.3: Tumour shrinkage in rats. 


For the true odds ratio: 
1. Find the sample odds ratio. 
2. Find the sample log-odds ratio. 


3. Compute a 95% confidence interval (zo.95 = 1.645; 2.975 = 1.96) for the true log odds 
ratio and true odds ratio. 


2.2.5 The Logit Function and Entropy 


PRB-24 @ CH.PRB- 2.21. 


The entropy (see Chapter 4) of a single binary outcome with probability p to receive 1 is 
defined as: 


H(p) = —plogp — (1 — p) log(1 — p). (2.7) 
1. At what p does H(p) attain its maximum value? 


2. What is the relationship between the entropy H(p) and the logit function, given p? 
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2.2.6 Python/PyTorch/CPP 


PRB-25 € CH.PRB- 2.22. 
The following C++ code (Fig. 2.7) is part of a (very basic) logistic regression implement- 
ation module. For a theoretical discussion underlying this question, refer to problem 2.17. 


1] #include 

2| std: :vector<double> theta {-6,0.05,1.0}; 
3}double sigmoid (double x) { 

4| double tmp =1.0 / (1.0 + exp(-x)); 

5 std::cout << "prob=" << tmp<<std::endl; 
6| return tmp; 

7| } 
s}double hypothesis (std: :vector<double> x) { 

9 double z; 

o) z=std::inner_product (std::begin (x), std::end(x), 
> std::begin (theta), 0.0); 
1 std::cout << "inner_product=" << z<<std::endl; 
2| return sigmoid(z); 

3|} 
aint classify (std: :vector<double> x) { 

5 int hypo=hypothesis(x) > 0.5f; 

6| std::cout << "hypo=" << hypo<<std::endl; 
7 return hypo; 

8| } 
gint main() { 

2| Sstd::vector<double> x1 {1,40,3.5}; 
a| classify(x1); 


22| } 


FIGURE 2.7: Logistic regression in CPP 


1. Explain the purpose of line 10, i.e., inner_product. 


2. Explain the purpose of line 15, i.e., hypo(x) > 0.5f. 
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3. What does 0 (theta) stand for in line 2? 


4. Compile and run the code, you can use: 
https://repl.it/languages/cpp11 to evaluate the code. 
What is the output? 


PRB-26 O CH.PRE- 2.23. 
The following Python code (Fig. 2.8) runs a very simple linear model on a two-dimensional 
matrix. 


import torch 
import torch.nn as nn 


lin = nn.Linear(5, 7) 
data = (torch.randn(3, 5)) 


print (lin (data) .shape) 
>|? 


00 ND GO A ON e 


FIGURE 2.8: A linear model in PyTorch 


Without actually running the code, determine what is the size of the matrix printed as a 
result of applying the linear model on the matrix. 


PRB-27 @ CH.PRB- 2.24. 
The following Python code snippet (Fig. 2.9) is part of a logistic regression implementa- 
tion module in Python. 
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from scipy.special import expit 
import numpy as np 
import math 


def Func001 (x): 
_X = Ap. exp (x np.max (x) ) 


return e_x / e_x.sum() 


Co 0. XN DH 0 FF 0 No. 


def Func002 (x): 
10 | return 1 / (1 + math.exp(-x)) 


v|def Func003 (x): 
3| return x x (1-x) 


FIGURE 2.9: Logistic regression methods in Python. 


Analyse the methods (Eunc001) (Eunc002) and (Eunc003) presented in Fig. 2.9, find their 


purposes and name them. 


PRB-28 € CH.PRB- 2.25. 
The following Python code snippet (Fig. 2.10) is part of a machine learning module in 
Python. 
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A SSA 

from scipy.special import expit 
import numpy as np 

import math 


Co mw XN DH oO FF YB > E 


ONS IE 
def Func0O06(y_hat, y): 
if y == 1: 
return -np.log(y_hat) 
else: 
10 return -np.log(1 - y_hat)^^I 


FIGURE 2.10: Logistic regression methods in Python. 


Analyse the method (Func006) presented in Fig. 2.10. What important concept in machine- 
learning does it implement? 


PRB-29 @ CH.PRE- 2.26. 
The following Python code snippet (Fig. 2.11) presents several different variations of the 
same function. 
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from scipy.special import expit 
import numpy as np 
import math 


def Ver001 (x): 
return 1 / (1 + math.exp(-x)) 


Co 0. NX DH 0 FF 0 > E 


def Ver002 (x): 
1w| return 1 / (1 + (np.exp(-x))) 


2|WHO_AM_I = 709 


u|def Ver003 (x): 
| return 1 / (1 + np.exp(- (np.clip(x, WHO AM Ii; None)))) 


FIGURE 2.11: Logistic regression methods in Python. 


1. Which mathematical function do these methods implement? 
2. What is significant about the number in line 11? 


3. Given a choice, which method would you use? 


2.3 Solutions 
2.3.1 General Concepts 


SOL-4 @ CH.SOL- 2.1. 

True. However, when an excessive and unnecessary number of variables is used in a lo- 
gistic regression model, peculiarities (e.g., specific attributes) of the underlying data set dis- 
proportionately affect the coefficients in the model, a phenomena commonly referred to as 
“overfitting”. Therefore, it is important that a logistic regression model does not start training 
with more variables than is justified for the given number of observations. a 
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SOL-5 YY CH.SOL- 2.2. 
The odds of success are defined as the ratio between the probability of success p € [0,1] 
and the probability of failure 1 — p. Formally: 


p 
Odds(p) = | —— }. 2.8 
s(p) ( = 5) (2.8) 

For instance, assuming the probability of success of an event is p = 0.7. Then, in our 
example, the odds of success are 7/3, or 2.333 to 1. Naturally, in the case of equal probabilities 
where p = 0.5, the odds of success is 1 to 1. 


SOL-6 Y CH.SOL- 2.3. 


1. An interaction is the product of two single predictor variables implying a non-additive 


effect. 


2. The simplest interaction model includes a predictor variable formed by multiplying two 
ordinary predictors. Let us assume two variables X and Z. Then, the logistic regression 
model that employs the simplest form of interaction follows: 


Po + Bi X +B22 + P3XZ, (2.9) 


where the coefficient for the interaction term X Z is represented by predictor (3. 


3. For testing the contribution of an interaction, two principal methods are commonly 
employed; the Wald chi-squared test or a likelihood ratio test between the model with 
and without the interaction term. Note: How does interaction relates to information 
theory? What added value does it employ to enhance model performance? 


SOL-7 UY CH.SOL- 2.4. 
False. This is exactly the definition of supervised learning; when labels are known then 
supervision guides the learning process. a 
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SOL-8 @ CH.SOL- 2.5. 
In the case of logistic regression, the response variable is the log of the odds of being clas- 
sified in a group of binary or multi-class responses. This definition essentially demonstrates 


that odds can take the form of a vector. le 


SOL-9 Y CH.SOL- 2.6. 

When a transformation to the response variable is applied, it yields a probability distribu- 
tion over the output classes, which is bounded between 0 and 1; this transformation can be 
employed in several ways, e.g., a softmax layer, the sigmoid function or classic normalization. 
This representation facilitates a soft-decision by the logistic regression model, which permits 
construction of probability-based processes over the predictions of the model. Note: What are 
the pros and cons of each of the three aforementioned transformations? a 


SOL-10 @ CH.SOL- 2.7. 
Minimizing the negative log likelihood also means maximizing the likelihood of selecting 


the correct class. m 


2.3.2 Odds, Log-odds 


SOL-11 UY CH.SOL- 2.8. 


1. The odds of the event occurring are, by definition: 


0.1 
dds = (—) = 0.11. 2.10 
odds = (—=) =0 (2.10) 


2. The log-odds of the event occurring are simply taken as the log of the odds: 


log-odds = In(0.1/0.9) = —2.19685. (2.11) 


3. The probability may be constructed by the following representation: 


odds OOl 


= = 01 2.12 
odds +1 1.11 í ( ) 


probability = 
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or, alternatively: 
exp (Inodds) — 0.11 | 


~ exp(Inodds)+1 1.11 — 


Note: What is the intuition behind this representation? 


0.1. (2.13) 


SOL-12 UY CH.SOL- 2.9. 
True. By definition of odds, it is easy to notice that p = 0.8 satisfies the following relation: 


0.8 
odds = (5) =A (2.14) 


SOL-13 Y CH.SOL- 2.10. 
The graph of odds to probabilities is depicted in Figure 2.12. 


10,0 5 
Odds — odds(p) = ¡E 

8,0 = 

6,0 - 

4.0 - 

2.0 - 
Probability 


T T T T j > 
0,1 0,2 03 04 05 06 0,7 08 09 


FIGURE 2.12: Odds vs. probability values. 
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SOL-14 @ CH.SOL- 2.11. 
A binary logistic regression GLM consists of there components: 


1. Random component: refers to the probability distribution of the response variable (Y ), 
e.g., binomial distribution for Y in the binary logistic regression, which takes on the 
values Y =0or Y = 1. 


2. Systematic component: describes the explanatory variables: 
(X,, X2, ...) as a combination of linear predictors. The binary case does not constrain 
these variables to any degree. 


3. Link function: specifies the link between random and systematic components. It says 
how the expected value of the response relates to the linear predictor of explanatory 
variables. 


Note: Assume that Y denotes whether a human voice activity was detected (Y = 1) 
or not (Y = 0) in a give time frame. Propose two systematic components and a link 
function adjusted for this task. 


SOL-15 Y CH.SOL- 2.12. 
The hyperplane is simply defined by: 


bo +07 X =0. (2.15) 


Note: Recall the use of the logit function and derive this decision boundary rigorously. a 


SOL-16 UY CH.SOL- 2.13. 
True. The logit function is defined as: 


z(p) = logit(p) = log (4). (2.16) 
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or any p € (0, 1]. A simple set of algebraic equations yields the inverse relation: 


exp z 
= 2.17 
a ET (2.17) 
which exactly describes the relation between the output and input of the logistic function, also 
known as the sigmoid. a 


2.3.3 The Sigmoid 


SOL-17 Y CH.SOL- 2.14. 
There are various approaches to solve this problem, here we provide two; direct derivation 
or derivation via the softmax function. 


1. Direct derivation: 
Lo(a) = 2((1+e-*)-}) = (1 + e72) D) 201 + 072) = E. 


2. Softmax derivation: 
In a classification problem with mutually exclusive classes, where all of the values are 
positive and sum to one, a softmax activation function may be used. By definition, the 
softmax activation function consists of n terms, such that Vi € [1, n]: 


e 1 


= = : 2.18 
Nr ev 1+ ed hell e% ) 


(i) 


To compute the partial derivative of 2.18, we treat all 0, where k 4 i as constants and 
then differentiate 0; using regular differentiation rules. For a given 0,, let us define: 


i= ye (2.19) 
kżi 
and i 
= af. pedis 
10) = 5 = At Bey, (2.20) 


It can now be shown that the derivative with respect to 0; holds: 


10) = (1+ Be) "Ber, (2.21) 
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which can take on the informative form of: 
f'(0:) = F(0,)(1— F(0,)). (2.22) 


It should be noted that 2.21 holds for any constant 3, and for 3 = 1 it clearly reduces 
to the sigmoid activation function. 


Note: Characterize the sigmoid function when its argument approaches 0, oo and —00. 
What undesired properties of the sigmoid function do this values entail when considered as an 
activation function? 

a 


SOL-18 Uy CH.SOL- 2.15. 


1. The logit value is simply obtained by substituting the values of the dependent variables 
and model coefficients into the linear logistic regression model, as follows: 


2. According to the natural relation between the logit and the odds, the following holds: 


odds = ev2 — efotieitPor2 — e-l — 03678794. (2.24) 
3. The odds ratio is, by definition: 


_Ply=1) 
odds = Ply =0)’ (2.25) 


so the logistic response function is: 


1 1 
Py = 1) = [ag = 14 = 9.208044. (2.26) 


2.3.4 Truly Understanding Logistic Regression 
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SOL-19 Ud CH.SOL- 2.16. 


1. Tumour eradication (Y) is the response variable and cancer type (X) is the explanatory 
variable. 


2. Relative risk (RR) is the ratio of risk of an event in one group (e.g., exposed group) 
versus the risk of the event in the other group (e.g., non-exposed group). The odds ratio 
(OR) is the ratio of odds of an event in one group versus the odds of the event in the 
other group. 


3. If we calculate odds ratio as a measure of association: 


Gs ON a (2.27) 
69 x 260 


And the log-odds ratio is (log(1.23745)) = 0.213052: 


The odds ratio is larger than one, indicating that the odds for a breast cancer is more 
than the odds for a lung cancer to be eradicated. Notice however, that this result is too 
close to one, which prevents conclusive decision regarding the odds relation. 


Additionally, if we calculate relative risk as a measure of association: 


560 
RR = 222 — 1.0392. (2.28) 
69+36 


4. The 95% confidence interval for the odds-ratio, 0 is computed from the sample confid- 
ence interval for log odds ratio: 


i {— Aee 
5 (los(0)) = | | + — = 0.21886. (2.29) 
ô (1og(6)) i= 260 | 69 | 36 


Therefore, the 95% CI for log (0) is: 


0.213052 + 1.95 x 0.21886 = (0.6398298, —0.2137241). (2.30) 
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Therefore, the 95% CI for 0 is: 


(e70-210, ¿0.647) — (0.810, 1.909). (2.31) 


5. The CI (0.810, 1.909) contains 1, which indicates that the true odds ratio is not signi- 
ficantly different from 1 and there is not enough evidence that tumour eradication is 
dependent on cancer type. 


SOL-20 Ud CH.SOL- 2.17. 


1. By using the defined values for X, and Xə, and the known logistic regression model, 
substitution yields: 


e7 6+0.05X1 +X2 


BUX) = a = 3775, (2.32) 


2. The equation for the predicted probability tells us that: 


—6+0.05.X44+3.5 


e 
(1 + e~6+0.05X1+3.5) — 0.5, (2.33) 

which is equivalent to constraining: 
e76+0.05X1+3.5 =} (2.34) 


By taking the logarithm of both sides, we get that the number of milliseconds needed is: 


2.5 
Xı = — = 50. 2. 
1 = 505 50 (2.35) 


so CH.SOL- 2.18. 
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For the purpose of this exercise, it is instructive to pre-define z as: 


¿(X1, X2) = —6.36 — 1.02 x X,+0.12x Xp. (2.36) 


1. By employing the classic logistic regression model: 


odds = exp(z (Xi, X2)). (2.37) 


2. By substituting the given values of X,, Xa into z (X,, Xə), the probability holds: 


p = exp(z (1, 100))/(1 + exp(z (1, 100))) = 0.99. (2.38) 


3. Yes. The coefficient for coffee consumption is positive (0.119) and the p-value is less 


than 0.05 (0.0304). 
Note: Can you describe the relation between these numerical relations and the positive 


conclusion? 


4, No. The p-value for this predictor is 0.3818 > 0.05. 
Note: Can you explain why this inequality implicates a lack of statistical evidence? 


SOL-22 Uy CH.SOL- 2.19. 


1. The estimated probability of improvement is: 


it(gum bacteria) = 
exp(—4.8792 + 0.0258 x gum bacteria) 
1 + exp(—4.8792 + 0.0258 x gum bacteria) 


Hence, (33) = 0.01748. 
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2. For «(gum bacteria) = 0.5 we know that: 


O E Ch Ld a (2.39) 
1+ exp(@ + bx) 
gum bacteria = —â/ Ê = 4.8792/0.0258 = 189.116. (2.40) 
3. The estimated odds ratio are given by: 

exp(3) = exp(0.0258) = 1.0504. (2.41) 

4, A 99% confidence interval for 6 is calculated as follows: 
Ê + 20.005 x ASE(8) = (2.42) 
0.0258 + 2.576 x 0.0194 (2.43) 
= (—0.00077, 0.9917). (2.44) 


Therefore, a 99% confidence interval for the true odds ratio exp() is given by: 


(exp(—0.00077), exp(0.9917)) = (0.99923, 2.6958). (2.45) 


SOL-23 Uy CH.SOL- 2.20. 


1. The sample odds ratio is: 


a 130 x 6833 
0 = ——_—_ = 2.1842. 2.46 
60 x 6778 : ( ) 
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2. The estimated standard error for log (ô) is: 


P I 1 1 1 
ô (log) = | = 0.1570. 2.47 
ô (1088) = + 6833 * 130 * 6778 aa 


3. According to previous sections, the 95% CI for the true log odds ratio is: 
0.7812 + 1.96 x 0.1570 = (0.4734, 1.0889). (2.48) 


Correspondingly, the 95% CI for the true odds ratio is: 


(604784 ¿1.0889 — (1.6060, 2.9710). (2.49) 


2.3.5 The Logit Function and Entropy 


SOL-24 UY CH.SOL- 2.21. 


1. The entropy (Fig. 2.13) has a maximum value of log, (2) for probability p = 1/2, which 
is the most chaotic distribution. A lower entropy is a more predictable outcome, with 
zero providing full certainty. 


2. The derivative of the entropy with respect to p yields the negative of the logit func- 
tion: 


= —logit(p). (2.50) 


Note: The curious reader is encouraged to rigorously prove this claim. 


2.3.6 Python, PyTorch, CPP 


soLas CH.SOL- 2.22. 
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0 0.2 0.4 0.6 0.8 1 


Probability 


FIGURE 2.13: Binary entropy. 


1. During inference, the purpose of inner_product is to multiply the vector of logistic re- 
gression coefficients with the vector of the input which we like to evaluate, e.g., calculate 
the probability and binary class. 


2. The line hypo(x) > 0.5f is commonly used for the evaluation of binary classification 
wherein probability values above 0.5 (i.e., a threshold) are regarded as TRUE whereas 
values below 0.5 are regarded as FALSE. 


3. The term 0 (theta) stands for the logistic regression coefficients which were evaluated 
during training. 


4, The output is as follows: 


1> inner_product=-0.5 
2> prob=0.377541 
3> hypo=0 


FIGURE 2.14: Logistic regression in C++ 
| 


Isor-26 CH.SOL- 2.23. 
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Because the second dimension of is 7, and the first dimension of is 3, the result- 


ing matrix has a shape of |torch.Size([3, 7])}. 


SOL-27 Y CH.SOL- 2.24. 
Ideally, you should be able to recognize these functions immediately upon a request from 


the interviewer. 


1. A softmax function. 
2. A sigmoid function. 


3. A derivative of a sigmoid function. 


SOL-28 Uy CH.SOL- 2.25. 
The function implemented in Fig. 2.10 is the cross-entropy function. a 


SOL-29 Ud CH.SOL- 2.26. 


1. All the methods are variations of the sigmoid function. 


2. In Python, approximately 1.797e + 308 holds the largest possible valve for a floating 
point variable. The logarithm of which is evaluated at 709.78. If you try to execute the 
following expression in Python, it will result in (nf; np.log(1.8e + 308). 


3. I would use (Ver003) because of its stability. Note: Can you entail why is this method 
more stable than the others? 
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3 PROBABILISTIC PROGRAMMING & BAYESIAN DL 


Anyone who considers arithmetical methods of producing random digits is, of 
course, in a state of sin. 


— John von Neumann (1903-1957) 
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Bayesian Deep Learning Secos Tr 


3.1. INTRODUCTION 


3.1 Introduction 


A] HE Bayesian school of thought has permeated fields such as mechanical 
{| statistics, classical probability, and financial mathematics [13]. In tandem, 


programming libraries such as PyMc3 and Stan [11] have emerged and have become 
widely adopted by the machine learning community. 

This chapter aims to introduce the Bayesian paradigm and apply Bayesian infer- 
ences in a variety of problems. In particular, the reader will be introduced with real- 
life examples of conditional probability and also discover one of the most important 
results in Bayesian statistics: that the family of beta distributions is conjugate to a bi- 
nomial likelihood. It should be stressed that Bayesian inference is a subject matter 
that students evidently find hard to grasp, since it heavily relies on rigorous probab- 
ilistic interpretations of data. Specifically, several obstacles hamper with the prospect 
of learning Bayesian statistics: 


1. Students typically undergo merely basic introduction to classical probability and 
statistics. Nonetheless, what follows requires a very solid grounding in these 
fields. 


2. Many courses and resources that address Bayesian learning do not cover essen- 
tial concepts. 


3. A strong comprehension of Bayesian methods involves numerical training and 
sophistication levels that go beyond first year calculus. 


Conclusively, this chapter may be much harder to understand than other chapters. 
Thus, we strongly urge the readers to thoroughly solve the following questions and 
verify their grasp of the mathematical concepts in the basis of the solutions [8]. 


3.2 Problems 


3.2.1 Expectation and Variance 


PRB-30 O CH.PRE- 3.1. 
Define what is meant by a Bernoulli trial. 
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PRB-31 @ CH.PRB- 3.2. 
The binomial distribution is often used to model the probability that k out of a group of n 
objects bare a specific characteristic. Define what is meant by a binomial random variable 


PRB-32 @ CH.PRB- 3.3. 
What does the following shorthand stand for? 


X ~ Binomial(n, p) (3.1) 


PRB-33 @ CH.PRB- 3.4. 
Find the probability mass function (PMF) of the following random variable: 


X ~ Binomial(n, p) (3.2) 


PRB-34 @ CH.PRB- 3.5. 
Answer the following questions: 


1. Define what is meant by (mathematical) expectation. 


2. Define what is meant by variance. 


3. Derive the expectation and variance of a the binomial random variable X ~ Binomial(n, p) 
in terms of p and n. 


PRB-35 @ CH.PRB- 3.6. 

Proton therapy (PT) is a widely adopted form of treatment for many types of cancer [6]. 
A PT device which was not properly calibrated is used to treat a patient with pancreatic 
cancer (Fig. 3.1). As a result, a PT beam randomly shoots 200 particles independently and 
correctly hits cancerous cells with a probability of 0.1. 
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FIGURE 3.1: Histopathology for pancreatic cancer cells. 


1. Find the statistical distribution of the number of correct hits on cancerous cells in 
the described experiment. What are the expectation and variance of the corresponding 
random variable? 


2. A radiologist using the device claims he was able to hit exactly 60 cancerous cells. 
How likely is it that he is wrong? 


3.2.2 Conditional Probability 


PRB-36 @ CH.PRB- 3.7. 
Given two events A and B in probability space H, which occur with probabilities P(A) 
and P(B), respectively: 


1. Define the conditional probability of A given B. Mind singular cases. 
2. Annotate each part of the conditional probability formulae. 


3. Draw an instance of Venn diagram, depicting the intersection of the events A and B. 
Assume that AU B = H. 


PRB-37 @ CH.PRB- 3.8. 

Bayesian inference amalgamates data information in the likelihood function with known 
prior information. This is done by conditioning the prior on the likelihood using the Bayes 
formulae. Assume two events A and B in probability space H, which occur with probabilities 
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P(A) and P(B), respectively. Given that AU B = H, state the Bayes formulae for this case, 
interpret its components and annotate them. 


PRB-38 @ CH.PRB- 3.9. 

Define the terms likelihood and log-likelihood of a discrete random variable X given 
a fixed parameter of interest y. Give a practical example of such scenario and derive its 
likelihood and log-likelihood. 


PRB-39 @ CH.PRB- 3.10. 
Define the term prior distribution of a likelihood parameter in the continuous case. 


PRB-40 O CH.PRB- 3.11. 
Show the relationship between the prior, posterior and likelihood probabilities. 


PRB-41 © CH.PRB- 3.12. 
In a Bayesian context, if a first experiment is conducted, and then another experiment is 
followed, what does the posterior become for the next experiment? 


PRB-42 @ CH.PRB- 3.13. 
What is the condition under which two events A and B are said to be statistically 
independent? 


3.2.3 Bayes Rule 


PRB-43 @ CH.PRB- 3.14. 

In an experiment conducted in the field of particle physics (Fig. 3.2), a certain particle 
may be in two distinct equally probable quantum states: integer spin or half-integer spin. 
It is well-known that particles with integer spin are bosons, while particles with half-integer 
spin are fermions [4]. 
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/ 
/ 


E 
FIGURE 3.2: Bosons and fermions: particles with half-integer spin are fermions. 


A physicist is observing two such particles, while at least one of which is in a half-integer 
state. What is the probability that both particles are fermions? 


PRB-44 @ CH.PRB- 3.15. 

During pregnancy, the Placenta Chorion Test [1] is commonly used for the diagnosis of 
hereditary diseases (Fig. 3.3). The test has a probability of 0.95 of being correct whether or 
not a hereditary disease is present. 
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FIGURE 3.3: Foetal surface of the placenta 


It is known that 1% of pregnancies result in hereditary diseases. Calculate the probability 
of a test indicating that a hereditary disease is present. 


PRB-45 @ CH.PRB- 3.16. 

The Dercum disease [3] is an extremely rare disorder of multiple painful tissue growths. 
In a population in which the ratio of females to males is equal, 5% of females and 0.25% of 
males have the Dercum disease (Fig. 3.4). 


FIGURE 3.4: The Dercum disease 


A person is chosen at random and that person has the Dercum disease. Calculate the 
probability that the person is female. 


PRB-46 @ CH.PRB- 3.17. 
There are numerous fraudulent binary options websites scattered around the Internet, 
and for every site that shuts down, new ones are sprouted like mushrooms. A fraudulent AI 
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based stock-market prediction algorithm utilized at the New York Stock Exchange, (Fig. 3.6) 
can correctly predict if a certain binary option [7] shifts states from 0 to 1 or the other way 
around, with 85% certainty. 


FIGURE 3.5: The New York Stock Exchange. 


A financial engineer has created a portfolio consisting twice as many state-1 options then 
state-0 options. A stock option is selected at random and is determined by said algorithm to 
be in the state of 1. What is the probability that the prediction made by the Al is correct? 


PRB-47 © CH.PRB- 3.18. 

In an experiment conducted by a hedge fund to determine if monkeys (Fig. 3.6) can 
outperform humans in selecting better stock market portfolios, 0.05 of humans and 1 out of 
15 monkeys could correctly predict stock market trends correctly. 
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FIGURE 3.6: Hedge funds and monkeys. 


From an equally probable pool of humans and monkeys an “expert” is chosen at ran- 
dom. When tested, that expert was correct in predicting the stock market shift. What is the 
probability that the expert is a human? 


PRB-48 @ CH.PRB- 3.19. 

During the cold war, the U.S.A developed a speech to text (STT) algorithm that could 
theoretically detect the hidden dialects of Russian sleeper agents. These agents (Fig. 3.7), 
were trained to speak English in Russia and subsequently sent to the US to gather intelli- 
gence. The FBI was able to apprehend ten such hidden Russian spies [9] and accused them 


of being “sleeper” agents. 
FIGURE 3.7: Dialect detection. 
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The Algorithm relied on the acoustic properties of Russian pronunciation of the word 
(v-o-k-s-a-1) which was borrowed from English V-a-u-x-h-a-I-l. It was alleged that it is im- 
possible for Russians to completely hide their accent and hence when a Russian would 
say V-a-u-x-h-a-1-1, the algorithm would yield the text "v-o-k-s-a-l”. To test the algorithm 
at a diplomatic gathering where 20% of participants are Sleeper agents and the rest Americ- 
ans, a data scientist randomly chooses a person and asks him to say V-a-u-x-h-a-1-1. A single 
letter is then chosen randomly from the word that was generated by the algorithm, which 
is observed to be an “1”. What is the probability that the person is indeed a Russian sleeper 
agent? 


PRB-49 € CH.PRB- 3.20. 

During World War II, forces on both sides of the war relied on encrypted communica- 
tions. The main encryption scheme used by the German military was an Enigma machine 
[5], which was employed extensively by Nazi Germany. Statistically, the Enigma machine 
sent the symbols X and Z Fig. (3.8) according to the following probabilities: 


2 
P(X) = 5 (3.3) 
7 
P(Z)=5 (3.4) 
A.— N — & 
B— O. 1,—— 
Cc Dive 2..— 
D— Q..— r re 
E R. 4. — 
F. — S $. === 
G —— T— 6 
H DU. — r ¡UA 
I V...— 8 — 
J—. = W.== 9-..— 
K-.,— X.— 0 — — 
L — Y. 
M —— D ss 


FIGURE 3.8: The Morse telegraph code. 


In one incident, the German military sent encoded messages while the British army used 
countermeasures to deliberately tamper with the transmission. Assume that as a result of the 
British countermeasures, an X is erroneously received as a Z (and mutatis mutandis) with a 
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probability =. If a recipient in the German military received a Z, what is the probability that 
a Z was actually transmitted by the sender? 


3.2.4 Maximum Likelihood Estimation 


PRB-50 @ CH.PRB- 3.21. 
What is likelihood function of the independent identically distributed (1.i.d) random 
variables: 
Xi, ++- , Xn where X; ~ binomial(n, p), Vi € [1, n), 
and where p is the parameter of interest? 


PRB-51 @ CH.PRB- 3.22. 
How can we derive the maximum likelihood estimator (MLE) of the i.i.d samples 
Xı, +- , Xn introduced in Q. 3.21? 


PRB-52 @ CH.PRB- 3.23. 
What is the relationship between the likelihood function and the log-likelihood function? 


PRB-53 @ CH.PRB- 3.24. 
Describe how to analytically find the MLE of a likelihood function? 


PRB-54 @ CH.PRB- 3.25. 
What is the term used to describe the first derivative of the log-likelihood function? 


PRB-55 @ CH.PRB- 3.26. 
Define the term Fisher information. 


3.2.5 Fisher Information 
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PRB-56 @ CH.PRB- 3.27. 

The 2014 west African Ebola (Fig. 9.10) epidemic has become the largest and fastest- 
spreading outbreak of the disease in modern history [2] with a death tool far exceeding all 
past outbreaks combined. Ebola (named after the Ebola River in Zaire) first emerged in 1976 
in Sudan and Zaire and infected over 284 people with a mortality rate of 53%. 


FIGURE 3.9: The Ebola virus. 


This rare outbreak, underlined the challenge medical teams are facing in containing epi- 
demics. A junior data scientist at the center for disease control (CDC) models the possible 
spread and containment of the Ebola virus using a numerical simulation. He knows that out 
of a population of k humans (the number of trials), x are carriers of the virus (success in 
statistical jargon). He believes the sample likelihood of the virus in the population, follows a 
Binomial distribution: 


L(y | y) = , ) Pi=y", yel], y=1,2,...,n (3.5) 


As the senior researcher in the team, you guide him that his parameter of interest is y, 
the proportion of infected humans in the entire population. The expectation and variance of 
the binomial distribution are: 


Elyly,n) =ny, Vln) =n- 1) (3.6) 
Answer the following; for the likelihood function of the form L.,(-y): 


1. Find the log-likelihood function |, (y) = In Lz(7). 
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. Find the gradient of |. (y). 
. Find the Hessian matrix H (y). 


. Find the Fisher information 1 (y). 


a A Q N 


. In a population spanning 10,000 individuals, 300 were infected by Ebola. Find the 
MLE for y and the standard error associated with it. 


PRB-57 © CH.PRB- 3.28. 

In this question, you are going to derive the Fisher information function for several 
distributions. Given a probability density function (PDF) f(X|y), you are provided with 
the following definitions: 


1. The natural logarithm of the PDF ln f(X|y) = ®(X|7). 
2. The first partial derivative P'(X |»). 
3. The second partial derivative D” (X |»). 


4, The Fisher Information for a continuous random variable: 
I(y) = -E, [8(X|y)]. (3.7) 


Find the Fisher Information I(y) for the following distributions: 
1. The Bernoulli Distribution X ~ B(1, y). 


2. The Poisson Distribution X ~ Poiss(0). 


PRB-58 @ CH.PRB- 3.29. 


1. True or False: The Fisher Information is used to compute the Cramer-Rao bound on 
the variance of any unbiased maximum likelihood estimator. 


2. True or False: The Fisher Information matrix is also the Hessian of the symmetrized 
KL divergence. 
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3.2.6 Posterior & prior predictive distributions 


PRB-59 @ CH.PRB- 3.30. 
In chapter 3 we discussed the notion of a prior and a posterior distribution. 


1. Define the term posterior distribution. 


2. Define the term prior predictive distribution. 


PRB-60 O CH.PRB- 3.31. 

Let y be the number of successes in 5 independent trials, where the probability of success 
is 0 in each trial. Suppose your prior distribution for 0 is as follows: P(@ = 1/2) = 0.25, 
P(0 = 1/6) = 0.5, and P(0 = 1/4) = 0.25. 


1. Derive the posterior distribution p(0|y) after observing y. 


2. Derive the prior predictive distribution for y. 


3.2.7 Conjugate priors 


PRB-61 € CH.PRB- 3.32. 
In chapter 3 we discussed the notion of a prior and a posterior. 


1. Define the term conjugate prior. 


2. Define the term non-informative prior. 


The Beta-Binomial distribution 


PRB-62 @ CH.PRB- 3.33. 

The Binomial distribution was discussed extensively in chapter 3. Here, we are going to 
show one of the most important results in Bayesian machine learning. Prove that the family 
of beta distributions is conjugate to a binomial likelihood, so that if a prior is in that 
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family then so is the posterior. That is, show that: 
x ~ Ber(y), y“ Bla,8) => yx Bla, 6’) (3.8) 
For instance, for h heads and t tails, the posterior is: 


B(h+a,t+8) (3.9) 


3.2.8 Bayesian Deep Learning 


PRB-63 O CH.PRE- 3.34. 

A recently published paper presents a new layer for a new Bayesian neural network 
(BNN). The layer behaves as follows. During the feed-forward operation, each of the hidden 
neurons H,,,n € 1,2 in the neural network (Fig. 3.10) may, or may not fire independently 
of each other according to a known prior distribution. 


9-09 
oe 


FIGURE 3.10: Likelihood in a BNN model. 
The chance of firing, y, is the same for each hidden neuron. Using the formal definition, 
calculate the likelihood function of each of the following cases: 


1. The hidden neuron is distributed according to X ~ binomial(n, y) random variable 
and fires with a probability of y. There are 100 neurons and only 20 are fired. 


2. The hidden neuron is distributed according to X ~ Uniform(0, y) random variable 
and fires with a probability of y. 


PRB-64 € CH.PRB- 3.35. 
Your colleague, a veteran of the Deep Learning industry, comes up with an idea for for 
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a BNN layer entitled OnOffLayer. He suggests that each neuron will stay on (the other 
state is off) following the distribution f(x) = e”? for x > 0 and f(x) = 0 otherwise 
(Fig. 3.11). X indicates the time in seconds the neuron stays on. In a BNN, 200 such 
neurons are activated independently in said OnOffLayer. The OnOffLayer is set to off (e.g. 
not active) only if at least 150 of the neurons are shut down. Find the probability that 
the OnOffLayer will be active for at least 20 seconds without being shut down. 


QO sine = 10) <0" 


FIGURE 3.11: OnOffLayer in a BNN model. 


PRB-65 @ CH.PRB- 3.36. 
A Dropout layer [12] (Fig. 3.12) is commonly used to regularize a neural network model 
by randomly equating several outputs (the crossed-out hidden node H) to 0. 


Dropout 


Oo 


vc 


A 
FIGURE 3.12: A Dropout layer (simplified form). 


For instance, in PyTorch [10], a Dropout layer is declared as follows (3.1): 
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. 


import torch 
import torch.nn as nn 
Nn DLOpoue (0-21) 


N 


ics) 


CODE 3.1: Dropout in PyTorch 


Where nn.Dropout(0.2) (Line #3 in 3.1) indicates that the probability of zeroing an 


element is 0.2. 
i 
(0) < ~A 


(0 T 


FIGURE 3.13: A Bayesian Neural Network Model 


A new data scientist in your team suggests the following procedure for a Dropout layer 
which is based on Bayesian principles. Each of the neurons 0, in the neural network in (Fig. 
8.33) may drop (or not) independently of each other exactly like a Bernoulli trial. 

During the training of a neural network, the Dropout layer randomly drops out outputs 
of the previous layer, as indicated in (Fig. 3.12). Here, for illustration purposes, all two 
neurons are dropped as depicted by the crossed-out hidden nodes H.,,. 

You are interested in the proportion 0 of dropped-out neurons. Assume that the chance of 
drop-out, 0, is the same for each neuron (e.g. a uniform prior for 0). Compute the posterior 


of 0. 


PRB-66 @ CH.PRB- 3.37. 

A new data scientist in your team, who was formerly a Quantum Physicist, suggests 
the following procedure for a Dropout layer entitled QuantumDrop which is based on 
Quantum principles and the Maxwell Boltzmann distribution. In the Maxwell-Boltzmann 


57 | 


3.2. PROBLEMS 


distribution, the likelihood of finding a particle with a particular velocity v is provided by: 


AnN y m \3/? 4 _ me 
ZA — Bk 3.10 
n(v)du V (25) v e~ kT du (3.10) 
-1074 
| — Helium 
4 a _| 
S 
a, 
2 | | 
0 m | 


0 1000 2000 3000 4000 5000 
vin ms” 
FIGURE 3.14: The Maxwell-Boltzmann distribution. 
In the suggested QuantumDrop layer (3.15), each of the neurons behaves like a molecule 
and is distributed according to the Maxwell-Boltzmann distribution and fires only when 
the most probable speed is reached. This speed is the velocity associated with the highest 


point in the Maxwell distribution (3.14). Using calculus, brain power and some mathem- 
atical manipulation, find the most likely value (speed) at which the neuron will fire. 


Cor) neuron — fires 


FIGURE 3.15: A QuantumDrop layer. 
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3.3 Solutions 


3.3.1 Expectation and Variance 


SOL-30 Y CH.SOL- 3.1. 
The notion of a Bernoulli trial refers to an experiment with two dichotomous binary out- 


comes; success (x = 1), and failure (x = 0). Py 


SOL-31 Y CH.SOL- 3.2. 
A binomial random variable X = k represents k successes in n mutually independent 


Bernoulli trials. Py 


SOL-32 Y CH.SOL- 3.3. 
The shorthand X ~ Binomial(n, p) indicates that the random variable X has the bi- 


nomial distribution (Fig. 3.16). The positive integer parameter n indicates the number of 
Bernoulli trials and the real parameter p, O < p< 1 holds the probability of success in each of 
these trials. 


p(x) 
0,4 o n=50,p=0.3 | 


FIGURE 3.16: The binomial distribution. 


sos CH.SOL- 3.4. 


59 | 


3.3. SOLUTIONS 


The random variable X ~ Binomial(n, p) has the following PMF: 


P(X =k)= (io a k=0,1,2,...,n. (3.11) 


SOL-34 Y CH.SOL- 3.5. 
The answers below regard a discrete random variable. The curious reader is encouraged to 
expend them to the continuous case. 


1. For a random variable X with probability mass function P(X = k) and a set of out- 


comes K, the expected value of X is defined as: 


E[X]:= Y kP(X =k). (3.12) 


kek 
Note: The expectation of X may also be denoted by ux. 

. The variance of X is defined as: 

Var[X] := E|(X — B[X])?]. (3.13) 


Note: The variance of X may also be denoted by oñ, while o x itself denotes the stand- 
ard deviation of X. 


. The population mean and variance of a binomial random variable with parameters n 
and p are: 


ElX] = np V[X] = np(1 — p) (3.14) 


Note: Why is this solution intuitive? What information theory-related phenomenon 
occurs when p = 1/2? 


soLas CH.SOL- 3.6. 


60 


Chapter 3 PROBABILISTIC PROGRAMMING & BAYESIAN DL 


1. This scenario describes an experiment that is repeated 200 times independently with a 
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success probability of 0.1. Thus, if the random variable X denotes the number of times 
success was obtained, then it is best characterized by the binomial distribution with 
parameters n = 200 and p = 0.1. Formally: 


X ~ Binomial(200, 0.1). (3.15) 
The expectation of X is given by: 
x = Elx) = 200 x 0.1 = 20, (3.16) 
and its respective variance is: 


Var = 200 x 0.10(1 — 0.10) = 18.0. (3.17) 


. Here we propose two distinguished methods to answer the question. 


Primarily, the straightforward solution is to employ the definition of the binomial dis- 
tribution and substitute the value of X in it. Namely: 


P(X = 60;n = 200, p = 0.1) 
2 
= ( 0.1% (1 — 0.1) 200-69 (3.18) 


=x 2.7 x e7". 


This leads to an extremely high probability that the radiologist is mistaken. 


The following approach is longer and more advanced, but grants the reader with insights 
and intuition regarding the results. To derive how wrong the radiologist is, we can 
employ an approximation by considering the standard normal distribution. In statistics, 
the Z-score allows us to understand how far from the mean is a data point in units of 
standard deviation, thus revealing how likely it is to occur (Fig. 3.17). 


3.3. SOLUTIONS 


DN 
Tp 
zZz = i 
Oo 
ee 
Standard dev. 


FIGURE 3.17: Z-score 


(3.19) 


Therefore, the probability of correctly hitting 60 cells is: 


s 60 — 20 
-~ y18.0 


Again, the outcome shows the likelihood that the radiologist was wrong approaches 1. 
Note: Why is the relation depicted in Fig. 3.17 deduces that Z is a standard Gaussian? 
Under what terms is this conclusion valid? Why does eq. (3.20) employs the cumulative 
distribution function and not the probability mass function? 


P(X > 60) = P(Z ) = P(Z > 9.428) ~ 0. (3.20) 


3.3.2 Conditional Probability 


SOL-36 Ud CH.SOL- 3.7. 


1. For two events A and B with P(B) > 0, the conditional probability of A given that 
B has occurred is defined as: 
P(AN B) 


P(AIB) = Fra 


(3.21) 


It is easy to note that if P(B) = 0, this relation is not defined mathematically. In this 
case, P(A|B) = P(AN B) = P(A). 


2. The annotated probabilities are displayed in Fig. 3.18: 
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A given B 


P(A|B) = 


FIGURE 3.18: Conditional probability 


(3.22) 


3. An example of a diagram depicting the intersected events A and B is displayed in Fig. 
i be 


H 


FIGURE 3.19: Venn diagram of the intersected events A and B in probability space H 


SOL-37 Y CH.SOL- 3.8. 
The Bayes formulae reads: 


P(B|A)P(A) 
(BIA P(A) FPIBJA PAS)" 


P(A|B) = = (3.23) 


where P(A°) is the complementary probability of P(A). The interpretation of the elements in 
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Bayes formulae is as follows: 


likelihood of the data x prior probability (3.24) 
normalization constant l 


posterior probability = 


Note: What is the o role of the normalization constant? Analyze the cases where 


P(B) — 0 and P(B) > 1. The annotated aore are displayed in (Fig. 3.20): 
Likelihood 


P(B|A)P 
e PB P(A lo <> (A 


FIGURE 3.20: Annotated components of the Bayes formula (eq. 3.23) 


(3.25) 


SOL-38 UY CH.SOL- 3.9. 
Given X as a discrete randomly distributed variable and given y as the parameter of 


interest, the likelihood and the log-likelihood of X given y follows respectively: 


Ly (XA =2)= p(X =xy) (3.26) 


£,(X =z) =In (p(X = z|7)) (3.27) 


The term likelihood can be intuitively understood from this definition; it deduces how likely is 
to obtain a value x when a prior information is given regarding its distribution, namely the 
parameter y. For example, let us consider a biased coin toss with p, = y. Then: 


LX = hl) = p(X =“h"y) = 7. (3.28) 


L(X = “h") = In (p(X = “h"|y)) = In (y). (3.29) 
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Note: The likelihood function may also follow continuous distributions such as the normal 
distribution. In the latter, it is recommended and often obligatory to employ the log-likelihood. 
Why? We encourage the reader to modify the above to the continuous case of normal distribu- 
tion and derive the answer. a 


SOL-39 GJ CH.SOL- 3.10. 

The continuous prior distribution, f(T = y) represents what is known about the probab- 
ility of the value y before the experiment has commenced. It is termed as being subjective, 
and therefore may vary considerably between researchers. By proceeding the previous example, 
f(T = 0.8) holds the probability of randomly flipping a coin that yields “heads” with chance 
of 80% of times. n 


SOL-40 Y CH.SOL- 3.11. 

The essence of Bayesian analysis is to draw inference of unknown quantities or quantiles 
rom the posterior distribution p(l = y|X = x), which is traditionally derived from prior 
beliefs and data information. Bayesian statistical conclusions about chances to obtain the para- 
meter I = y or unobserved values of random variable X = x, are made in terms of prob- 
ability statements. These probability statements are conditional on the observed values of X, 
which is denoted as p(T = y| X = x), called posterior distributions of parameter y. Bayesian 
analysis is a practical method for making inferences from data and prior beliefs using probab- 
ility models for quantities we observe and for quantities which we wish to learn. Bayes rule 
provides a relationship of this form: 


posterior x p(x|y)p(y) x data given prior x chance of prior . (3.30) 


SOL-41 y CH.SOL- 3.12. 

The posterior density summarizes what is known about the parameter of interest y after 
the data is observed. In Bayesian statistics, the posterior density p(l = y| X = x) becomes 
the prior for this next experiment. This is part of the well-known Bayesian updating mech- 
anism wherein we update our knowledge to reflect the actual distribution of data that we 
observed. To summarize, from the perspective of Bayes Theorem, we update the prior distri- 
bution to a posterior distribution after seeing the data. a 
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SOL-42 UY CH.SOL- 3.13. 
Two events A and B are statistically independent if (and only if): 


P(AN B) = P(A)P(B). (3.31) 


Note: Use conditional probability and rationalize this outcome. How does this property be- 
come extremely useful in practical researches that consider likelihood of normally distributed 


3.3.3 Bayes Rule 


SOL-43 UY CH.SOL- 3.14. 
Let y stand for the number of half-integer spin states, and given the prior knowledge that 
both states are equally probable: 


Py =2>1) (8.32) 
_P(y=23y>1) 
~~ PGS) a 
= Piy=2) — 1/4 _ 1 
 1-P(y=0) 1-1/4 3 i 
Note: Under what statistical property do the above relations hold? m 


SOL-44 Y CH.SOL- 3.15. 

Let event A indicate present hereditary-disease and let event B to hold a positive test result. 
The calculated probabilities are presented in Table 3.1. We were asked to find the probability 
of a test indicating that hereditary-disease is present, namely P(B). According to the law of 
total probability: 


P(B) = P(B|A) * P(A) + P(BJA) * P(A) 


3.35 
= [0.95 x 0.01] + [0.05 * 0.99] = 0.059 oe) 


Note: In terms of performance evaluation, P(B|A) is often referred to as the probability of 
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PROBABILITY 
P(A)= 0.01 
P(A)=1-0.01=.99 
P(B | A)=0.95 


P(B | B)=1-0.95=.05 


P(B | A)=0.95 


P(B | A)=1-0.95=.05 


EXPLANATION 


The probability of hereditary-disease. 
The probability of no hereditary-disease. 


The probability that the test will yield a negative result [B] if 
hereditary-disease is NOT present [A]. 


The probability that the test will yield a positive result [B] 
if hereditary-disease is NOT present [A] (probability of false 
alarm). 


The probability that the test will yield a positive result [B] if 
hereditary-disease is present [A] (probability of detection). 


The probability that the test will yield a negative result [B] if 
hereditary-disease is present [A]. 


TABLE 3.1: Probability values of hereditary-disease detection. 


detection and P(B|A) is considered the probability of false alarm. Notice that these measures 
do not, neither logically nor mathematically, combine to probability of 1. a 


SOL-45 UY CH.SOL- 3.16. 
We first enumerate the probabilities one by one: 


P(Dercum| female) = 0.05, (3.36) 
P(Dercum|male) = 0.0025, (3.37) 
P(male) = P(female) = 0.5. (3.38) 


We are asked to find P(female| Dercum). Using Bayes Rule: 


P(female|Dercum) = 
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P(Dercum| female) P(female) 


P(Dercum) (232 


3.3. SOLUTIONS 


However we are missing the term P(Dercum). To find it, we apply the Law of Total Probab- 
ility: 


P(Dercum) = P(Dercum| female) P(female) 
+P(Dercum|male) P(male) 
0.05 - 0.5 + 0.0025 - 0.5 = 0.02625. 


And finally, returning to eq. (3.39): 


0.05-0.5 
P le|D = —_ x 0.9524 A 
(female|Dercum) 0.02695 0.95 (3.40) 
Note: How could this result be reached with one mathematical equation? m 


SOL-46 UY CH.SOL- 3.17. 
In order to solve this problem, we introduce the following events: 


1. AI: the AI predicts that the state of the stock option is 1. 
2. Statel: the state of the stock option is 1. 


3. State0: the state of the stock option is 0. 
A direct application of Bayes formulae yields: 


P(Statel| AD = (3.41) 


P(AI|State1) P(State1) 
P(AI|State1) P(State1)+P(AI|State0) P(State0) (3.42) 


_ 0.85-2/3 L 
= 0852/340.151/3 ~ 0.9189. 


SOL-47 @ CH.SOL- 3.18. In order to solve this problem, we introduce the following events: 


1. H: a human. 
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2. M:a monkey. 


3. C: a correct prediction. 


By employing Bayes theorem and the Law of Total probability: 


P(C|H)P(H) + P(C|M)P(M) (3.43) 


Note: If something seems off in this outcome, do not worry - it is a positive sign for 
understanding of conditional probability. a 


SOL-48 Y CH.SOL- 3.19. 
In order to solve this problem, we introduce the following events: 


1. RUS: a Russian sleeper agent is speaking. 
2. AM: an American is speaking. 


er 


3. L: the TTS system generates an 


We are asked to find the value of P(RUS|L). Using Bayes Theorem we can write: 
P(L|RUS)P(RUS) 


P L) = . 
(RUS|L) PD (3.44) 
We were told that the Russians consist 1/5 of the attendees at the gathering, therefore: 
1 
P(RUS) = 5 (3.45) 
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Additionally, because "v-o-k-s-a-l” has a single l out of a total of six letters: 
P(L|RU S) = a (3.46) 
Additionally, because “V-a-u-x-h-a-l-l” has two l's out of a total of eight letters: 
P(L|AM) = a (3.47) 
An application of the Law of Total Probability yields: 
(3.48) 


E (5) (3) | (5) 6 ~ 30° 


5 


P(L) = P(AM)P(L|AM) + P(RUS)P(L|RUS) 
n = 


Using Bayes Theorem we can write: 
(3.49) 


P(RUS|L) = 2% 
30 


Note: What is the letter by which the algorithm is most likely to discover a Russian sleeper 
a 


agent? 


SOL-49 & CH.SOL- 3.20. 
We are given that: 
P(X is erroneously received as a Z) = 1/7. Using Bayes Theorem we can write: 


(3.50) 


P(Z trans|Z received) = 
_ P(Z received|Z trans) P(Z trans) 
P(Z received) 
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An application of the Law of Total Probability yields: 


P(Z received) = 
P(Z received|Z trans) P(Z trans) 
+P(Z received|X trans)P(X trans) 


|l 
TRE 
SIA `. 


OIN 
+ 
=P. 
Ol] bh 


So, using Bayes Rule, we have that 


P(Z trans|Z received) 
P(Z received|Z trans) P(Z trans) 
P(Z received) 


(3.51) 


3.3.4 Maximum Likelihood Estimation 


SOL-50 & CH.SOL- 3.21. 
For the set of 1.i.d samples X,,--- , Xn, the likelihood function is the product of the 


probability functions: 


n 3.52 
=i ("ena =p)". C2) 


3.3. SOLUTIONS 


SOL-51 Y CH.SOL- 3.22. 
The maximum likelihood estimator (MLE) of p is the value of all possible p values that 
maximizes L(p). Namely, the p value that renders the set of measurements X,,--- , Xn as the 


most likely. Formally: 


p = arg MaxXo<p<1L(p) (3.53) 


Note: The curious student is highly encouraged to derive from L(p). Notice that L(p) can 
be extremely simplified. 


SOL-52 y CH.SOL- 3.23. 

The log-likelihood is the logarithm of the likelihood function. Intuitively, maximizing 
the likelihood function L(y) is equivalent to maximizing In L(y) in terms of finding the MLE 
4, since ln is a monotonically increasing function. Often, we maximize In(f(y)) instead of 
the f(y). A common example is when L(y) is comprised of normally distribution random 


Formally, if X,,--- , Xn are 1.1.d, each with probability mass function (PMF) of fx,(xi | y), 
then 


n 


FO) = II f(z Im, (3.54) 
(O) = $t fakes | 9) (3.55) 


SOL-53 Uy CH.SOL- 3.24. 
The general procedure for finding the MLE, given that the likelihood function is differen- 


tiable, is as follows: 


1. Start by differentiating the log-likelihood function ln (L(y)) with respect to a parameter 
of interest y. 


2. Equate the result to zero. 
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3. Solve the equation to find + that holds: 


Oln L( | £1,- Ln) 


5 =0 (3.56) 


4. Compute the second derivative to verify that you indeed have a maximum rather than 
a minimum. 


SOL-54 UY CH.SOL- 3.25. 
The first derivative of the log-likelihood function is commonly known as the Fisher score 


function, and is defined as: 


E Sa) 


u(7) a (8.57) 


SOL-55 @ CH.SOL- 3.26. 
Fisher information, is the term used to describe the expected value of the second derivat- 
ives (the curvature) of the log-likelihood function, and is defined by: 


SO ln L(y | £1,- £n) 
I(y) = -E ay? (3.58) 


3.3.5 Fisher Information 


SOL-56 YY CH.SOL- 3.27. 
1. Given L(y): 


In L(y) = In ( ny ) +yxlIn(y) + (n — y) In(1 — y). (3.59) 
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SOLUTIONS 


. To find the gradient, we differentiate once: 


(=r Ha 
(a9) gana), 


. The Hessian is generated by differentiating g(7): 


H(y) =-yy? — (n - y)(1 - 7)? 


. The Fisher information is calculated as follows: 


since: 


Elyly,n) =n*¥ 


. Equating the gradient to zero and solving for our parameter y, we get: 


~ Y 
y= 
n 


(3.60) 


(3.61) 


(3.62) 


(3.63) 


(3.64) 


In our case this equates to: 300/10000 = 0.03. Regarding the error, there is a close 
relationship between the variance of y and the Fisher information, as the former is the 


inverse of the latter: 


LY 
vq) = 
n 
Plugging the numbers from our question: 
+,  0.03(1 — 0.03) _7 
= —__—_ = 2. 107’. 
vn 10000 en 


(3.65) 


(3.66) 
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Statistically, the standard error that we are asked to find is the square root of eq. 3.66 
which equals 5.3 x 107*. Note: What desired property is revealed in this experiment? 


At was cost could we ensure a low standard error? 


SOL-57 Y CH.SOL- 3.28. 
The Fisher Information for the distributions is as follows: 


1. Bernoulli: 
B(x|y) = xlog y + (1 — 2) log(1 — y), 


x l-r 
P'(x|y) = — — 
(eh) ==- >= 


i x 
22) = Y = d=" 


E Aen aeaa 


are (== 


2. Poisson: 


Ax|0) = x log 6 — log x! — 9, 
e=08 


(3.67) 


(3.68) 


(3.69) 


(3.70) 


(3.71) 


SoLss CH.SOL- 3.29. 
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1. True. 
2. True. 


3.3.6 Posterior & prior predictive distributions 


SOL-59 Y CH.SOL- 3.30. 


1. Given a sample of the form x = (x1,-*- ,2n) drawn from a density p(0; x) and 0 is 
randomly generated according to a prior density of p(@). Then the posterior density is 
defined by: 


p(6|z) = === (3.72) 
2. The prior predictive density is: 


ple) = [_¿p(0:2)p(0)40 (3.73) 


SOL-60 Y CH.SOL- 3.31. 


1. The posterior p(0|y) x p(y|@)p(@) is: 


¥(1/2)°-¥0.25, @=1/2 


(5) (1/2)"(1/2) 
(°)(a/6)"5/6)*0.5, 9 =1/6 
(°)(a/4)¥(8/4)°-¥0.25, 0 = 1/4 
0 


; otherwise 


2. The prior predictive distribution p(y): 


(7) ((1/2)%(1/2)Y0.25 (3.74) 
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+ 


(1/6)%(5/6)%0.5 + (1/4)Y(3/4)%%0.25). (3.75) 


3.3.7 Conjugate priors 


SOL-61 UY CH.SOL- 3.32. 


1. Aclass F of prior distributions is said to form a conjugate family if the posterior density 
is in F for all each sample, whenever the prior density is in F. 


2. Often we would like a prior that favours no particular values of the parameter over 
others. Bayesian analysis requires prior information, however sometimes there is no 
particularly useful information before data is collected. In these situations, priors with 
“no information” are expected. Such priors are called non-informative priors. 


SOL-62 Y CH.SOL- 3.33. 


Ifa ~ B(n, y) so 
p(x|y) « y- y 
and the prior for y is Bla, B) so 
py) aa L= > 
then the posterior is 
qlz ~ Bla+x,B+n-x) 
It is immediately clear the family of beta distributions is conjugate to a 


JOTA 


binomial likelihood. 


3.3.8 Bayesian Deep Learning 
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SOL-63 UY CH.SOL- 3.34. 


1. The hidden neuron is distributed according to: 
X ~ binomial(n, y) random variable and fires with a probability of y. There are 100 


neurons and only 20 are fired. 
P(x =20/0) = e ) g2 (1 — 0)® (3.76) 


2. The hidden neuron is distributed according to: 
X uniform(0, y) random variable and fires with a probability of y. 


The uniform distribution is, of course, a very simple case: 


f(z;a,b) = — fr asi <b (3.77) 
=a 
Therefore: 
0 ify<xorx <0 
feh) = a (3.78) 
l/y f0<x<0 


SOL-64 YY CH.SOL- 3.35. 
The provided distribution is from the exponential family. Therefore, a single neuron be- 


comes inactive with a probability of: 


20 
p=P(X < 20) = f e-* dr =1— e”, (3.79) 
0 


The OnOffLayer is off only if at least 150 out of 200 neurons are off. Therefore, this may be 
represented as a Binomial distribution and the probability for the layer to be off is: 


ver 200 pa-pa (3.80) 


n 
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Hence, the probability of the layer being active for at least 20 seconds is 1 minus this value: 
[1 — V]. (3.81) 


SOL-65 Y CH.SOL- 3.36. 
The observed data, e.g the dropped neurons are distributed according to: 


(x1,...,2n)|0  Bern(0) (3.82) 
Denoting s and f as success and failure respectively, we know that the likelihood is: 
p(x1,...,2n10) = 6°(1 — 0)? (3.83) 
With the following parameters a = 8 = 1 the beta distribution acts like Uniform prior: 
0 ~ Beta(a, 8), given a = 8 =1 (3.84) 


Hence, the prior density is: 


01 (1 8) (3.85) 
Therefore the posterior is: 


P (0|x1, ma str) xp (£1, ERa , Znl0) p(0) 
E A) la AND (3.86) 
_ gr = gra 


SOL-66 UY CH.SOL- 3.37. 
Neurons are dropped whenever their value (or the equivalent quantum term- speed) reach 
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the most likely value: 


AnN y m \3/? 4 _ me 
=== ~ 2k 3.87 
n(v)du V (25) v e~ xT du (3.87) 


From calculus, we know that in order to maximize a function, we have to equate its first 
derivative to zero: 


—n(v)= 0 (3.88) 
The constants can be taken out as follows: 


a 
—vte RT = (3.89) 
du 


Applying the chain rule from calculus: 


mu? m mu? 
Que" RT + y? (720) e ar =0( (3.90) 


We notice that several terms cancel out: 
2 m 


E | 3 
r (3.91) 


Now the quadratic equation can be solved yielding: 


fer 
Umost_probable = ros (3.92) 


Therefore, this is the most probable value at which the dropout layer will fire. 
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l | INFORMATION THEORY 


A basic idea in information theory is that information can be treated very much 
like a physical quantity, such as mass or energy. 


— Claude Shannon, 1985 
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4.1. INTRODUCTION 


4.1 Introduction 


¿AN NDUCTIVE inference, is the problem of reasoning under conditions of in- 
complete information, or uncertainty. According to Shannon’s theory [2], 
y yea] information and uncertainty are two sides of the same coin: the more uncer- 

AD tainty there is, the more information we gain by removing the uncertainty. 
Entropy plays central roles in many scientific realms ranging from physics and statist- 
ics to data science and economics. A basic problem in information theory is encoding 
large quantities of information [2]. 

Shannon’s discovery of the fundamental laws of data compression and transmis- 
sion marked the birth of information theory. In his fundamental paper of 1948, “A 
Mathematical Theory of Communication” [4], Shannon proposed a measure of the uncer- 
tainty associated with a random memory-less source, called Entropy. 


FIGURE 4.1: Mutual information 


Entropy first emerged in thermodynamics in the 18* century by 

Carnot, [1] in his pioneering work on steam entitled “Reflection on the Motive Power of 
Fire” (Fig. 4.2). Subsequently it appeared in statistical mechanics where it was viewed 
as a measure of disorder. However, it was Boltzmann (4.30) who found the connection 
between entropy and probability, and the notion of information as used by Shannon is 
a generalization of the notion of entropy. Shannon’s entropy shares some instinct with 
Boltzmamn's entropy, and likewise the mathematics developed in information theory 
is highly relevant in statistical mechanics. 
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REFLECTIONS 


ON THE 


‘MOTIVE POWER OF HEAT. 


FROM THE ORIGINAL FRENCH OF 
N.-L.-S. QARNOT, 
Graduate of the Polytechnic School. 


ACCOMPANIED BY 


AN ACCOUNT OF CARNOT’S THEORY. 
By SIR WILLIAM THOMSON (LORD KELVIN). 


EDITED BY 


R. H. THURSTON, M.A., LL.D., Dr. Ene’ ; 
Director of Sibley College, Cornell University ; 
“ Oficier de l'Instruction Publique de France,” 
etc., etc., etc. 


FIGURE 4.2: Reflection on the motive power of fire. 


The majority of candidates I interview fail to come up with an answer to the fol- 
lowing question: what is the entropy of tossing a non-biased coin? Surprisingly, even after 
I explicitly provide them with Shannon’s formulae for calculating entropy (4.4), many 
are still unable to calculate simple logarithms. The purpose of this chapter is to present 
the aspiring data scientist with some of the most significant notions of entropy and 
to elucidate its relationship to probability. Therefore, it is primarily focused on basic 
quantities in information theory such as entropy, cross-entropy, conditional entropy, 
mutual information and Kullback-Leibler divergence, also known as relative entropy. 
It does not however, discuss more advanced topics such as the concept of ‘active in- 
formation” introduced by Bohm and Hiley [3]. 


4.2 Problems 
4.2.1 Logarithms in Information Theory 


It is important to note that all numerical calculations in this chapter use the binary 
logarithm logs. This specific logarithm produces units of bits, the commonly used units 
of information in the field on information theory. 
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4.2. PROBLEMS 


PRB-67 © CH.PRB- 4.1. 
Run the following Python code (4.3) in a Python interpreter. What are the results? 


print 
print 


print 
print 


Co 0. NN DH 0 Fe 0 NY 


print 
print 


=. 
© 


import math 
import numpy 


(math.log(1.0/0.98)) # Natural log (In) 
(numpy.log(1.0/0.02)) # Natural log (ln) 


(math.log10(1.0/0.98)) # Common log (base 10) 
(numpy .1og10(1.0/0.02)) # Common log (base 10) 


(math.log2(1.0/0.98)) # Binary log (base 2) 
(numpy.log2(1.0/0.02)) # Binary log (base 2) 


FIGURE 4.3: Natural (ln), binary (logy) and common (log,,) logarithms. 


PRB-68 @ CH.PRB- 4.2. 
The three basic laws of logarithms: 


1. First law 


log A + log B = log AB. (4.1) 


Compute the following expression: 


logi9 3 + logy 4. 


2. Second law 


log A” = nlog A. (4.2) 
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Compute the following expression: 
log, 4°. 
3. Third law 
A 
log A — log B = log B (4.3) 
Therefore, subtracting log B from log A results in log $. 


Compute the following expression: 


log, 15 — log, 3. 


4.2.2 Shannon's Entropy 


PRB-69 @ CH.PRB- 4.3. 
Write Shannon's famous general formulae for uncertainty. 


PRB-70 @ CH.PRB- 4.4. 
Choose exactly one, and only one answer. 


1. For an event which is certain to happen, what is the entropy? 


(a) 1.0 

(b) 0.0 

(c) The entropy is undefined 

(d) —1 

(e) 0.5 

(f) loga(N), N being the number of possible events 
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2. For N equiprobable events, what is the entropy? 


(a) 1.0 

(b) 0.0 

(c) The entropy is undefined 
(d) —1 

(e) 0.5 

(f loga(N) 


PRB-71 @ CH.PRB- 4.5. 
Shannon found that entropy was the only function satisfying three natural properties. 
Enumerate these properties. 


PRB-72 @ CH.PRB- 4.6. 

In information theory, minus the logarithm of the probability of a symbol (essentially 
the number of bits required to represent it efficiently in a binary code) is defined to be the 
information conveyed by transmitting that symbol. In this context, the entropy can be 
interpreted as the expected information conveyed by transmitting a single symbol from an 
alphabet in which the symbols occur with the probabilities rp. 

Mark the correct answer: Information is a/an [decreaselincrease] in uncertainty. 


PRB-73 @ CH.PRB- 4.7. 

Claud Shannon's paper “A mathematical theory of communication” [4], marked the 
birth of information theory. Published in 1948, it has become since the Magna Carta of the 
information age. Describe in your own words what is meant by the term Shannon bit. 


PRB-74 @ CH.PRB- 4.8. 
With respect to the notion of surprise in the context of information theory: 


1. Define what it actually meant by being surprised. 
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2. Describe how it is related to the likelihood of an event happening. 


3. True or False: The less likely the occurrence of an event, the smaller information it 
conveys. 


PRB-75 @ CH.PRB- 4.9. 

Assume a source of signals that transmits a given message a with probability P,. Assume 
further that the message is encoded into an ordered series of ones and zeros (a bit string) and 
that a receiver has a decoder that converts the bit string back into its respective message. 
Shannon devised a formulae that describes the size that the mean length of the bit string can 
be compressed to. Write the formulae. 


PRB-76 @ CH.PRB- 4.10. 
Answer the following questions: 


1. Assume a source that provides a constant stream of N equally likely symbols 
(11,12, ..., Un]. What does Shannon's formulae (4.4) reduce to in this particular 
case? 


2. Assume that each equiprobable pixel in a monochrome image that is fed to a DL classi- 
fication pipeline, can have values ranging from 0 to 255. Find the entropy in bits. 


PRB-77 © CH.PRB- 4.11. 
Given Shannon's famous general formulae for uncertainty (4.4): 


N 
H=-—Y P, logs Pa (bits per symbol). (4.4) 
a=1 


1. Plot a graph of the curve of probability vs. uncertainty. 


2. Complete the sentence: The curve is [symmetrical/asymmetrical]. 
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3. Complete the sentence: The curve rises to a [minimum/maximum] when the two 
symbols are equally likely (P, = 0.5). 


PRB-78 @ CH.PRB- 4.12. 
Assume we are provided with biased coin for which the event “heads” is assigned probab- 
ility p, and ‘tails’ - a probability of 1 — p. Using (4.4), the respective entropy is: 


H(p) = —p log p— (1 — p) log(1—p). (4.5) 


Therefore, H > 0 and the maximum possible uncertainty is attained when p = 1/2, is 
Hmax = logs 2. 

Given the above formulation, describe a helpful property of the entropy that follows from 
the concavity of the logarithmic function. 


PRB-79 @ CH.PRB- 4.13. 
True or False: Given random variables X, Y and Z where Y = X + Z then: 


H(X,Y) = H(X, Z). (4.6) 


PRB-80 @ CH.PRB- 4.14. 
What is the entropy of a biased coin? Suppose a coin is biased such that the probability 
of ‘heads’ is p(x) = 0.98. 


1. Complete the sentence: We can predict ‘heads’ for each flip with an accuracy of [__- 


_I%. 


2. Complete the sentence: If the result of the coin toss is ‘heads’, the amount of Shannon 
information gained is [___] bits. 


3. Complete the sentence: If the result of the coin toss is ‘tails’, the amount of Shannon 
information gained is [___] bits. 


4, Complete the sentence: It is always true that the more information is associated with 
an outcome, the [more/less] surprising it is. 
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5. Provided that the ratio of tosses resulting in ‘heads’ is p(x,), and the ratio of tosses 
resulting in ‘tails’ is p(x,), and also provided that p(x,)+p(x,) = 1, what is formulae 
for the average surprise? 


6. What is the value of the average surprise in bits? 


4.2.3 Kullback-Leibler Divergence (KLD) 


PRB-81 O CH.PRB- 4.15. 
Write the formulae for the Kullback-Leibler divergence between two discrete probability 
density functions P and Q. 


PRB-82 @ CH.PRB- 4.16. 
Describe one intuitive interpretation of the KL-divergence with respect to bits. 


PRB-83 @ CH.PRB- 4.17. 
1. True or False: The KL-divergence is not a symmetric measure of similarity, i.e.: 


Dxi(P1IQ) 4 Dxi(QUP). 


2. True or False: The KL-divergence satisfies the triangle inequality. 
3. True or False: The KL-divergence is not a distance metric. 


4. True or False: In information theory, KLD is regarded as a measure of the informa- 
tion gained when probability distribution Q is used to approximate a true probability 
distribution P. 


5. True or False: The units of KL-divergence are units of information. 


6. True or False: The KLD is always non-negative, namely: 


Dxx(P||Q) = 0. 
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4.2. PROBLEMS 


7. True or False: In a decision tree, high information gain indicates that adding a split 
to the decision tree results in a less accurate model. 


PRB-84 O CH.PRE- 4.18. 
Given two distributions fı and f and their respective joint distribution f, write the 
formulae for the mutual information of fı and fə. 


PRB-85 @ CH.PRE- 4.19. 
The question was commented out but remained here for the consistency of the numbering 
system. 


4.2.4 Classification and Information Gain 


PRB-86 @ CH.PRE- 4.20. 
There are several measures by which one can determine how to optimally split attributes 
in a decision tree. List the three most commonly used measures and write their formulae. 


PRB-87 O CH.PRB- 4.21. 
Complete the sentence: In a decision tree, the attribute by which we choose to split is 
the one with [minimum/maximum] information gain. 


PRB-88 @ CH.PRB- 4.22. 

To study factors affecting the decision of a frog to jump (or not), a deep learning re- 
searcher from a Brazilian rain-forest, collects data pertaining to several independent binary 
co-variates. 
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Ti + Re e dc te 
e AO 


FIGURE 4.4: A Frog in its natural habitat. Photo taken by my son. 


The binary response variable Jump indicates whether a jump was observed. Referring to 
Table (4.1), each row indicates the observed values, columns denote features and rows denote 
labelled instances while class label (Jump) denotes whether the frog had jumped. 


Observation | Green Rain Jump 
al 1 0 F 
x2 1 1 = 
x3 1 0 T 
x4 1 1 + 
xO 1 0 T 
x6 0 1 + 
zT 0 0 — 
x8 0 1 — 


TABLE 4.1: Decision trees and frogs. 


Without explicitly determining the information gain values for each of the three attrib- 
utes, which attribute should be chosen as the attribute by which the decision tree should be 
first partitioned? e.g which attribute has the highest predictive power regarding the decision 
of the frog (Fig. 4.4) to jump. 
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PRB-89 © CH.PRB- 4.23. 

This question discusses the link between binary classification, information gain and de- 
cision trees. Recent research [5] suggests that Cannabis (Fig. 4.5), and Cannabinoids ad- 
ministration in particular may reduce the size of malignant tumours in rodents. The data 
(Table 9.2) comprises a training set of feature vectors with corresponding class labels which 
a researcher intents classifying using a decision tree. 


FIGURE 4.5: Cannabis 


To study factors affecting tumour shrinkage, the deep learning researcher collects data 
regrading two independent binary variables; 0, (T/F) indicating whether the rodent is a fe- 
male, and 02 (T/F) indicating whether the rodent was administrated with Cannabinoids. The 
binary response variable, y, indicates whether tumour shrinkage was observed (e.g. shrink- 
age=+, no shrinkage=-). Referring to Table (9.2), each row indicates the observed values, 
columns (0;,) denote features and class label (y) denotes whether shrinkage was observed. 


y | 91 | 02 
+|T|T 
-|T|F 
+|T|F 
+|T|T 
-| FIT 


TABLE 4.2: Decision trees and Cannabinoids administration 
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1. Describe what is meant by information gain. 
Describe in your own words how does a decision tree work. 


Using loga, and the provided dataset, calculate the sample entropy H(7). 


A pp N 


What is the information gain IG(X,) = H(y) — H(|01) for the provided training 
corpus? 


PRB-90 @ CH.PRE- 4.24. 

To study factors affecting the expansion of stars, a physicist is provided with data re- 
grading two independent variables; 0, (T/F) indicating whether a star is dense, and 0, (T/F) 
indicating whether a star is adjacent to a black-hole. He is told that the binary response vari- 
able, y, indicates whether expansion was observed. 


e.g.: 
expansion=+, no expansion=-. Referring to table (4.3), each row indicates the observed val- 
ues, columns (0;) denote features and class label (~) denotes whether expansion was observed. 


7 (expansion) 0, (dense) 0, (black-hole) 
+ F T 
+ T T 
+ T T 
- F T 
+ T F 
- F F 
- F F 


TABLE 4.3: Decision trees and star expansion. 


1. Using log and the provided dataset, calculate the sample entropy H (y) (expansion) 
before splitting. 


2. Using logs and the provided dataset, calculate the information gain of H (|01). 
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3. Using logs and the provided dataset, calculate the information gain of H (|02). 


PRB-91 O CH.PRB- 4.25. 


To study factors affecting tumour shrinkage in humans, a deep learning researcher is 
provided with data regrading two independent variables; 0, (S/M/L) indicating whether the 
tumour is small(S), medium(M) or large(L), and 0, (T/F) indicating whether the tumour 
has undergone radiation therapy. He is told that the binary response variable, y, indicates 
whether tumour shrinkage was observed (e.g. shrinkage=+, no shrinkage=-). 

Referring to table (4.4), each row indicates the observed values, columns (0,) denote 
features and class label (y) denotes whether shrinkage was observed. 


y (shrinkage) 0, 6 


+ + + 
TISS£V00 
= er e e- i ies 


TABLE 4.4: Decision trees and radiation therapy. 


1. Using loga and the provided dataset, calculate the sample entropy H (y) (shrinkage). 
2. Using logs and the provided dataset, calculate the entropy of H (y|01). 
3. Using log and the provided dataset, calculate the entropy of H (y\02). 


4. True or false: We should split on a specific variable that minimizes the information 
gain, therefore we should split on 0, (radiation therapy). 


4.2.5 Mutual Information 


| PRB-92 @ CH.PRB- 4.26. 
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Shannon described a communications system consisting five elements (4.6), two of which 
are the source S and the destination D. 


MESSAGE 
(Sourse == 


SIGNAL 


SIGNAL 


Receiver | MESSAGE 


R 


Dest 
D 


FIGURE 4.6: Shannon's five element communications system. 


1. Draw a Venn diagram depicting the relationship between the entropies of the source 
H(S) and of the destination H(D). 


. Annotate the part termed equivocation. 
. Annotate the part termed noise. 


. Annotate the part termed mutual information. 


a A WwW N 


. Write the formulae for mutual information. 


PRB-93 @ CH.PRB- 4.27. 
Complete the sentence: The relative entropy D(p||q) is the measure of (a) [___] between 
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two distributions. It can also be expressed as a measure of the (b)[___] of assuming that the 
distribution is q when the (c)[___] distribution is p. 


PRB-94 @ CH.PRE- 4.28. 

Complete the sentence: Mutual information is a Shannon entropy-based measure of 
dependence between random variables. The mutual information between X and Z can be 
understood as the (a) [___] of the (b) [___] in X given Z: 


I(X;Z) := H(X) — H(X | Z), (4.7) 


where H is the Shannon entropy, and H(X | Z) is the conditional entropy of Z given X. 


4.2.6 Mechanical Statistics 


Some books have a tendency of sweeping "unseen" problems under the rug. We will 
not do that here. This subsection may look intimidating and for a good reason; it 
involves equations that, unless you are a physicists, you have probably never en- 
countered before. Nevertheless, the ability to cope with new concepts lies at the heart 
of every job interview. 

For some of the questions, you may need these constants: 


PHYSICAL CONSTANTS 


k Boltzmanns constant 1.381 x 10723 J K-t 
c Speed of lightin vacum 2.998 x 10m s”* 
h Planck's constant 6.626 x 10% J s 


PRB-95 @ CH.PRE- 4.29. 
What is the expression for the Boltzmann probability distribution? 


PRB-96 @ CH.PRB- 4.30. 

Information theory, quantum physics and thermodynamics are closely interconnected. 
There are several equivalent formulations for the second law of thermodynamics. One ap- 
proach to describing uncertainty stems from Boltzmanns fundamental work on entropy in 
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statistical mechanics. Describe what is meant by Boltzmanns entropy. 


PRB-97 @ CH.PRB- 4.31. 
From Boltzmanns perspective, what is the entropy of an octahedral dice (4.7)? 


FIGURE 4.7: An octahedral dice. 


4.2.7 Jensen's inequality 


PRB-98 @ CH.PRB- 4.32. 


1. Define the term concave function. 
2. Define the term convex function. 


3. State Jensen's inequality and its implications. 


PRB-99 @ CH.PRB- 4.33. 
True or False: Using Jensen's inequality, it is possible to show that the KL divergence 
is always greater or equal to zero. 


4.3 Solutions 
4.3.1 Logarithms in Information Theory 
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4.3. SOLUTIONS 


SOL-67 UY CH.SOL- 4.1. 
Numerical results (4.8) are provided using Python interpreter version 3.6. 


import math 

import numpy 

print (math.log(1.0/0.98)) # Natural log (ln) 

> 0.020202 VOTS L VALS 

print (numpy.log(1.0/0.02)) # Natural log (ln) 

= 2. 91202300542816 

print (math.logl10(1.0/0.98)) # Common log (base 10) 
= UU SA SO SONS a2 

print (numpy.log10(1.0/0.02)) # Common log (base 10) 
> I 6989700043360187 

print (math.log2(1.0/0.98)) # Binary log (base 2) 

> UAM SSA 5659 is 

print (numpy.log2(1.0/0.02)) # Binary log (base 2) 
> 5 O43 05 61899 74724 


FIGURE 4.8: Logarithms in information theory. 


SOL-68 & CH.SOL- 4.2. 
The logarithm base is explicitly written in each solution. 


de 
Za 

log, 4° = 6 log, 4. 
3. 


log, 15 — log, 3 = log, 2 = log, 5. 
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4.3.2 Shannon's Entropy 


SOL-69 @ CH.SOL- 4.3. 
Shannons famous general formulae for uncertainty is: 


N 
H =-—> P, log, P, (bits per symbol). (4.8) 
a=1 


SOL-70 UY CH.SOL- 4.4. 


1. No information is conveyed by an event which is a-priori known to occur for certain 
(P, = 1), therefore the entropy is 0. 


2. Equiprobable events mean that P; = 1/N Vi € [1, N]. Therefore for N equally-likely 
events, the entropy is log,(N). 


SOL-71 Y CH.SOL- 4.5. 
The three properties are as follows: 


1. H(X) is always non-negative, since information cannot be lost. 
2. The uniform distribution maximizes H(X), since it also maximizes uncertainty. 


3. The additivity property which relates the sum of entropies of two independent events. 
For instance, in thermodynamics, the total entropy of two isolated systems which co- 
exist in equilibrium is the sum of the entropies of each system in isolation. 
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SOL-72 UY CH.SOL- 4.6. 
Information is an [increase] in uncertainty. a 


SOL-73 Y CH.SOL- 4.7. 

The Shannon bit has two distinctive states; it is either 0 or 1, but never both at the same 
time. Shannon devised an experiment in which there is a question whose only two possible 
answers were equally likely to happen. 

He then defined one bit as the amount of information gained (or alternatively, the amount 
of entropy removed) once an answer to the question has been learned. He then continued to 
state that when the a-priori probability of any one possible answer is higher than the other, the 
answer would have conveyed less than one bit of information. a 


SOL-74 @ CH.SOL- 4.8. 

The notion of surprise is directly related to the likelihood of an event happening. Mathem- 
atically is it inversely proportional to the probability of that event. 
Accordingly, learning that a high-probability event has taken place, for instance the sun rising, 
is much less of a surprise and gives less information than learning that a low-probability 
event, for instance, rain in a hot summer day, has taken place. Therefore, the less likely the 
occurrence of an event, the greater information it conveys. 
In the case where an event is a-priori known to occur for certain (P, = 1), then no inform- 
ation is conveyed by it. On the other hand, an extremely intermittent event conveys a lot of 
information as it surprises us and informs us that a very improbable state exists. Therefore, 
the statement in part 3 is false. 


SOL-75 UY CH.SOL- 4.9. 
This quantity I's, represented in the formulae is called the Shannon information of the 
source: 


Is, == 5 Pa 10g Pa- (4.9) 


It refers to the mean length in bits, per message, into which the messages can be compressed 
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to. It is then possible for a communications channel to transmit Is, bits per message with a 
capacity of Isp. 8 


SOL-76 Ud CH.SOL- 4.10. 


1. For N equiprobable events it holds that P; = 1/N, Vi € [1, N]. Therefore if we substi- 
tute this into Shannon's equation we get: 


AN 1 
H eguiprobable == 5 N log, N° (4.10) 


q=] 


Since N does not depend on i, we can pull it out of the sum: 


1 ieee 
Hequiprobable = —(— log, +) 1 (4.11) 
quip N N cl 
1 1 
= — (55108257) ¥ 
1 
= — log, > 4.12 
082 N ( ) 
= log, N. 


It can be shown that for a given number of symbols (i.e., N is fixed) the uncertainty H 
has its largest value only when the symbols are equally probable. 


2. The probability for each pixel to be assigned a value in the given range is: 


pi = 1/256. (4.13) 


Therefore the entropy is: 


H = —(256)(1/256)(—8) = 8 [bits/symbol]. (4.14) 


5017 CH.SOL- 4.11. 


105 | 


4.3. SOLUTIONS 


Refer to Fig. 4.9 for the corresponding illustration of the graph, where information is 
shown as a function of p. It is equal to O for p = 0 and for p = 1. This is reasonable because for 
such values of p the outcome is certain, so no information is gained by learning the outcome. 
The entropy in maximal uncertainty equals to 1 bit for p = 0.5. Thus, the information gain 
is maximal when the probabilities of two possible events are equal. Furthermore, for the entire 
range of probabilities between p = 0.4 and p = 0.6 the information is close to 1 bit. m 


0.8 
0.6 
Entropy 


0.4 


0.2 


0 0.2 0.4 0.6 0.8 1 


Probability 


FIGURE 4.9: H vs. Probability 


SOL-78 UY CH.SOL- 4.12. 

An important set of properties of the entropy follows from the concavity of the entropy, 
which follows from the concavity of the logarithm. Suppose that in an experiment, we cannot 
decide whether the actual probability of ‘heads’ is pı or pz. We may decide to assign probability 
q to the first alternative and probability 1 — q to the second. The actual probability of ‘heads’ 
then is the mixture qpı + (1 — q)p2. The corresponding entropies satisfy the inequality: 


S (qpi + (1 — q)p2) > gS (p1) + (1 — q) S (po), (4.15) 
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|These probabilities, are equal in the extreme cases where pı = p2, or q = 0, org = 1. m 


SOL-79 Y CH.SOL- 4.13. 

Given (X,Y), we can determine X and Z = Y — X. Conversely, given (X, Z), we can 
determine X and Y = X + Z. Hence, H(X, Y) = H(X, Z) due to the existence of this 
bijection. m 


SOL-80 UY CH.SOL- 4.14. 
The solution and numerical calculations are provided using logs. 


1. We can predict ‘heads’ for each flip with an accuracy of p(x,) = 98%. 


2. According to Fig. (4.10), if the result of the coin toss is ‘heads’, the amount of Shannon 
information gained is log,(1/0.98) [bits] . 


import math 

import numpy 

print (math.log2(1.0/0.98)) # Binary log (base 2) 
> DIOSAS Silas ik 

print (numpy.log2(1.0/0.02)) # Binary log (base 2) 
> 5.643856189774724 


DON e WO N e 


FIGURE 4.10: Shannon information gain for a biased coin toss. 


3. Likewise, if the result of the coin toss is ‘tails’, the amount of Shannon information 
gained is log,(1/0.02) [bits] . 


4, It is always true that the more information is associated with an outcome, the more 
surprising it is. 


5. The formulae for the average surprise is: 


A(x) = p(en) log + pler) log z: (4.16) 


107 | 


4.3. SOLUTIONS 


6. The value of the average surprise in bits is (4.11): 


H(x) = [0.98 x 0.0291] + [0.02 x 5.643] (4.17) 
= 0.1414 [bits]. 


1|import autograd.numpy as np 
2)def binaryEntropy (pP): 
3| return -p*np.log2(p) -(1-p)*np.log2(1-p) 
a (print. (MbinaryEntropy (p) IS: 1) 
=) Dips cormat (baneanmy Entro pyO09S)0) 
> binaryEntropy(p) is:0.1414 bits 


a 


FIGURE 4.11: Average surprise 


4.3.3 Kullback-Leibler Divergence 


SOL-81 Y CH.SOL- 4.15. 
For discrete probability distributions P and Q, the Kullback-Leibler divergence from P 
to Q, the KLD is defined as: 


D(P||Q) = E Plo) oe 5 (4.18) 
Q(x) P(x) 
= Hp(Q) —H(P). 


HS =_— 
Cross Entropy Entropy 
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SOL-82 Y CH.SOL- 4.16. 

One interpretation is the following: the KL-divergence indicates the average number of 
additional bits required for transmission of values x € X which are distributed according 
to P(x), but we erroneously encoded them according to distribution Q(x). This makes sense 
since you have to “pay” for additional bits to compensate for not knowing the true distribution, 
thus using a code that was optimized according to other distribution. This is one of the reason 
that the KL-divergence is also known as relative entropy. Formally, the cross entropy has an 
information interpretation quantifying how many bits are wasted by using the wrong code: 


code for Q 
1 
Hp(Q) = 2 Plz) log OR (4.19) 


Sending P 


SOL-83 Ud CH.SOL- 4.17. 


1. True KLD is a non-symmetric measure, i.e. D(P || Q) 4 D(Q || P). 
2. False KLD does not satisfy the triangle inequality. 

3. True KLD is not a distance metric. 
+ 


. True KLD is regarded as a measure of the information gain. Notice that, however, KLD 
is the amount of information lost. 


5. True The units of KL divergence are units of information (bits, nats, etc.). 
6. True KLD is a non-negative measure. 


7. True Performing splitting based on highly informative event usually leads to low model 
generalization and a less accurate one as well. 


sous CH.SOL- 4.18. 
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Formally, mutual information attempts to measure how correlated two variables are with 
each other: 


I(X;Y) = E Peo log FY (4.20) 
s plie A Sle 
~ ESE Pa) EPU) Py) 


= H(X) + H(Y) - H(X,Y). 


Regarding the question at hand, given two distributions fı and f and their joint distri- 
bution f, the mutual information of fı and fə is defined as I( fı, f2) = H(f, fife). If the 
two distributions are independent, i.e. f = fı - fo, the mutual information will vanish. This 
concept has been widely used as a similarity measure in image analysis. m 


SOL-85 Y CH.SOL- 4.19. 
The question was commented out but remained here for the consistency of the numbering 
system. m 


4.3.4 Classification and Information Gain 


SOL-86 Ud CH.SOL- 4.20. 
The three most widely used methods are: 


1. 
Entropy (0) = -Y pli) oga pli) (4.21) 

2. 
1 Yn? (4.22) 
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Classification error (t) = 1 — max[p(i)]. (4.23) 


2 


SOL-87 Y CH.SOL- 4.21. 
In a decision tree, the attribute by which we choose to split is the one with [maximum] 
information gain. a 


SOL-88 Y CH.SOL- 4.22. 
It is clear that the entropy will be decreased more by first splitting on Green rather than 
on Rain. 


Green 


œ] Jo 


Jump No Jump 


FIGURE 4.12: First split. 


SOL-89 Ud CH.SOL- 4.23. 


1. Information gain is the expected reduction in entropy caused by partitioning values in 
a dataset according to a given attribute. 


2. A decision tree learning algorithm chooses the next attribute to partition the currently 
selected node, by first computing the information gain from the entropy, for instance, 
as a splitting criterion. 


3. There are 3 positive examples corresponding to Shrinkage=+, and 2 negative examples 
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DON FF U N e 


corresponding to Shrinkage=-. Using the formulae: 


H(Y) = 


and the probabilities: 


k 


— »_P(Y = yi) log, P(Y = i) 


the overall entropy before splitting is (4.13): 


orig = —(3/5) log(3/5) — (2/5) log(2/5) 
= H(y) ~ 0.97095|bits/ symbol]. 


(4.24) 


(4.25) 


(4.26) 


(4.27) 


import autograd.numpy as np 


def binaryEntropy 
return -p*np.log2 (p) 


print 


> binaryEntropy (E) 


(p) : 


("binaryEntropy (p) 
is: 


—(1-p) «np.log2 (1-p) 


is:{} bits" format (binaryEntropy (4/ 7)))) 
0.97095 bits 


FIGURE 4.13: Entropy before splitting. 


4. If we split on 0,, (4.5) the relative shrinkage frequency is: 
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Total 01 =T 01 =f 


© 1 1 


(e9) 
© 


TABLE 4.5: Splitting on 0). 


To compute the information gain (IG) based on feature 0,, we must first compute the 
entropy of y after a split based on 01, H(y|01): 


H(y/01) 
v k 
= -5 Ply =7,|01 = 0) logs P(Y = 410, = 05) 
j=1 Li=1 
P(A, = 95) 


Therefore, using the data for the the relative shrinkage frequency (4.5), the information 
gain after splitting on 6, is: 


3. 3 4, 4 
Eo,=T = — log = log = 0.8112, 
484 4°83 (4.28) 
E ed 
=p = == O = 10) = U.U. 
ee ee E | 


Now we know that P(0, = T) = and P(0, = F) = (1/5) therefore: 


A= Borig — (4/5) Ep, =r — (1/5)]Eo, =F 
= 0.97095 — (4/5) x 0.8112 — (1/5) x (0.0) (4.29) 
== 0.32198 [bits/symbol]. 


SOL-90 UY CH.SOL- 4.24. 
There are 4 positive examples corresponding to Expansion=+, and 3 negative examples 
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corresponding to Expansion=-. 


1. The overall entropy before splitting is (4.14): 


Eorig = —(4/7) log(4/7) — (3/7) log(3/7) 


4.30 
= 0.9852281 [bits/symbol]. an 


import autograd.numpy as np 
def binaryEntropy (p): 
return -p*np.log2(p) -(1-p)*np.log2(1-p) 


print ("binaryEntropy (p) is 3{} bits” format (binaryEntropy (4/7)))) 
> binasyeneropy (P) TS: 0. 9892281 Dies 


DON e WO N e 


FIGURE 4.14: Entropy before splitting. 


2. If we split on 0, (4.6) the relative star expansion frequency is: 


Total 01 =T 01 =f 


3 1 
© 0 3 


TABLE 4.6: Splitting on 61. 


Therefore, the information gain after splitting on A is: 


Eo,=T = = log a = E log E = 0.0, 
3 E i kl ° (aal) 
Ea =F = =4 log q — z log q 0.81127. 
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Now we know that P(0, =T) = and P(0 = (4/7), therefore: 


A= Eorig n (3/7)\Eo, =r = (4/7))Eo =F 
= 0.98522 — (3/7) x 0.0 — (4/7) * (0.81127) (4.32) 
= 0.52163 [bits/symbol]. 


3. If we split on 02, (4.7) the relative star expansion frequency is: 


Total bə =T bə = 
+ 3 1 
- 1 2 


TABLE 4.7: Splitting on 0». 


The information gain after splitting on B is: 


3, 3 1, 1 
Eo,=T = = log 474 log 1 = 0.0.8112, 
1412.2 (4.33) 
Eg,=F = =3 log 373 log = 0.9182. 
Now we know that P(0, = T) = and P(0 = (3/7), therefore: 


A = Evig — (4/7)Eo,=r — (3/7) Eo.=r 
= 0.98522 — (4/7) + 0.8122 — (3/7) + (0.9182) 
0.1275 [bits/symbol]. 


A = 0.98522 — (4/7) * 0.8122 — (3/7) « (0.9182) 


4,34 
0.1275 [bits/symbol]. (4.34) 
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SOL-91 Ud CH.SOL- 4.25. 


1. 
H(y)=- € loga ” f C log» 5) 
H(y) = — G log, ; + = logs 5) (4.35) 
~ 0.92 [bits/symboll. 
í 1/1 1 
auiem É 1082 5 + 71083 >) a 
: G log, 5 | 5 log 5) — : (1log, 1). ao 
HO) =(0 + (0) + $(0) 
H (y0) = : ~ 0.66[bits/symbol]. 
is 1/1 2. 2) 1 
H (y0) = = (5 log, 3 | 3 log, 5) = (1 log, 1). 
H(yl02) = ; (log, 3 — 5) ; (4.37) 
H(yl02) = 5 log, 3 — : = 0.46 [bits/symbol]. 
4. False. 


4.3.5 Mutual Information 


SOL-92 Ud CH.SOL- 4.26. 


1. The diagram is depicted in Fig. 4.15. 
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FIGURE 4.15: Mutual Information between H(S) & H(D). 


2. Equivocation is annotated by E. 
3. Noise is annotated by N. 


4. The intersection (shaded area) in (4.15) corresponds to mutual information of the source 
H(S) and of the destination H(D). 


5. The formulae for mutual information is: 


H(S;D) = H(S) - E = H(D) - N. (4.38) 


SOL-93 @ CH.SOL- 4.27. 

The relative entropy D(p||q) is the measure of difference between two distributions. It 
can also be expressed like a measure of the inefficiency of assuming that the distribution is q 
when the true distribution is p. m 


SOL-94 & CH.SOL- 4.28. 
Mutual information is a Shannon entropy-based measure of dependence between random 
variables. The mutual information between X and Z can be understood as the reduction of 
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the uncertainty in X given Z: 
I(X; Z) := H(X) — H(X | Z), (4.39) 


where H is the Shannon entropy, and H(X | Z) is the conditional entropy of Z given X. a 


4.3.6 Mechanical Statistics 


SOL-95 GJ CH.SOL- 4.29. 
Is this question valuable? a 


SOL-96 & CH.SOL- 4.30. 

Boltzmann related the degree of disorder of the state of a physical system to the logarithm 
of its probability. If, for example, the system has n non-interacting and identical particles, 
each capable of existing in each of K equally likely states, the leading term in the logarithm of 
the probability of finding the system in a configuration with n, particles in state 1, na in state 
2, etc, is given by the Boltzmann entropy H, = — EE 7, log(r;), where 7; = n;/n. m 


SOL-97 @ CH.SOL- 4.31. 
There are 8 equiprobable events in each roll of the dice, therefore: 
| , 
H=->. 3 logs <= 3 [bits] . (4.40) 


i=1 


4.3.7 Jensen's inequality 


SOL-98 Ø CH.SOL- 4.32. 
1. A function f is concave in the range la, b] if f 2 is negative in the range [a,b]. 


2. A function f is convex in the range [a, b] if f 2 is positive in the range |a, b]. 
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3. The following inequality was published by J.L. Jensen in 1906: 


(Jensen's Inequality) Let f be a function convex up on (a,b). Then for any n > 2 
numbers x; € (a,b): 


Z ) 
n n 


f (=5 =) < Xi f(t) 
and that the equality is attained if and only if f is linear or all x; are equal. 
For a convex down function, the sign of the inequality changes to >. 


Jensen's inequality states that if f is convex in the range |a, b], then: 


fla) + £0) ES! 


ge ae 


2 


Equality holds if and only if a = b. Jensen’s inequality states that if f is concave in the 
range la, b], then: 


Equality holds if and only if a = b. 


SOL-99 UY CH.SOL- 4.33. 
True The non-negativity of KLD can be proved using Jensen's inequality. m 
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CHAPTER 


5 DEEP LEARNING: CALCULUS, ALGORITHMIC DIFFERENTIATION 


The true logic of this world is in the calculus of probabilities. 


— James C. Maxwell 
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5.1 Introduction 


ALCULUS is the mathematics of change; the differentiation of a function is 
key to almost every domain in the scientific and engineering realms and 
calculus is also very much central to DL. A standard curriculum of first year 
calculus includes topics such as limits, differentiation, the derivative, Taylor 
series, integration, and the integral. Many aspiring data scientists who lack a relevant 
mathematical background and are shifting careers, hope to easily enter the field but 
frequently encounter a mental barricade. 
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f(z) | F) 


sin(x) | cos(x) 


( 
cos(x) | — sin(x) 
( 


x) 


x 


log 
e 


Thanks to the rapid advances in processing power and the proliferation of GPUs, 
it is possible to lend the burden of computation to a computer with high efficiency 
and precision. For instance, extremely fast implementations of backpropagation, the 
gradient descent algorithm, and automatic differentiation (AD) [5] brought artificial in- 
telligence from a mere concept to reality. 

Calculus is frequently taught in a way that is very burdensome to the student, 

therefore I tried incorporating the writing of Python code snippets into the learning 
process and the usage of: 
DAGs (Directed Acyclic Graphs). Gradient descent is the essence of optimization in 
deep learning, which requires efficient access to first and second order derivatives that 
AD frameworks provide. While older AD frameworks were written in C++ ([4]), the 
newer ones are Python-based such as Autograd ([10]) and JAX ([3], [1]). 

Derivatives are also crucial in graphics applications. For example, in a render- 
ing technique entitled global illumination, photons bounce in a synthetically generated 
scene while their direction and colour has to be determined using derivatives based 
on the specific material each photon hits. In ray tracing algorithms, the colour of the 
pixels is determined by tracing the trajectory the photons travel from the eye of the 
observer through a synthetic 3D scene. 

A function is usually represented by a DAG. For instance, one commonly used 
form is to represent intermediate values as nodes and operations as arcs (5.2). One 
other commonly used form is to represent not only the values but also the operations 
as nodes (5.11). 

The first representation of a function by a DAG goes back to [7]. 
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FIGURE 5.1: Intermediate value theorem 


Manual differentiation is tedious and error-prone and practically unusable for real- 
time graphics applications wherein numerous successive derivatives have to be re- 
peatedly calculated. Symbolic differentiation on the other hand, is a computer based 
method that uses a collection of differentiation rules to analytically calculate an exact 
derivative of a function resulting in a purely symbolic derivatives. Many symbolic 
differentiation libraries utilize what is known as operator-overloading ([9]) for both the 
forward and reverse forms of differentiation, albeit they are not quite as fast as AD. 


5.2 Problems 
5.2.1 AD, Gradient descent & Backpropagation 


AD [5] is the application of the chain rule to functions by computers in order to auto- 
matically compute derivatives. AD plays a significant role in training deep learning 
algorithms and in order to understand AD you need a solid grounding in Calculus. As 
opposed to numerical differentiation, AD is a procedure for establishing exact deriv- 
atives without any truncation errors. AD breaks a computer program into a series of 
fundamental mathematical operations, and the gradient or Hessian of the computer 
program is found by successive application of the chain rule (5.1) to it’s elementary 
constituents. 
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For instance, in the C++ programming language, two techniques ([4]) are com- 
monly utilized in transforming a program that calculates numerical values of a func- 
tion into a program which calculates numerical values for derivatives of that function; 
(1) an operator overloading approach and (2) systematic source code transformation. 


ð 19) 
o (30 a (Fal R 6.1) 


One notable feature of AD is that the values of the derivatives produced by apply- 
ing AD, as opposed to numerical differentiation (finite difference formulas), are exact 
and accurate. Two variants of AD are widely adopted by the scientific community: the 
forward mode or the reverse mode where the underlying distinction between them is 
the order in which the chain rule is being utilized. The forward mode, also entitled 
tangent mode, propagates derivatives from the dependent towards the independent 
variables, whereas the reverse or adjoint mode does exactly the opposite. AD makes 
heavy use of a concept known as dual numbers (DN) first introduced by Clifford ([2]). 


00) 


exp(v1) 


FIGURE 5.2: A Computation graph with intermediate values as nodes and operations as 
arcs. 


5.2.2 Numerical differentiation 


PRB-100 € CH.PRB- 5.1. 


1. Write the formulae for the finite difference rule used in numerical differentiation. 


2. What is the main problem with this formulae? 
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3. Indicate one problem with software tools which utilize numerical differentiation and 
successive operations on floating point numbers. 


PRB-101 @ CH.PRB- 5.2. 


1. Given a function f(x) and a point a, define the instantaneous rate of change of 


f(x)ata. 


2. What other commonly used alternative name does the instantaneous rate of change 
have? 


3. Given a function f(x) and a point a, define the tangent line of f(x) at a. 


5.2.3 Directed Acyclic Graphs 


There are two possible ways to traverse a DAG (Directed Acyclic Graph). One 
method is simple. Start at the bottom and go through all nodes to the top of the com- 
putational tree. That is nothing else than passing the corresponding computation se- 
quence top down. Based on this method, the so called forward mode or of AD was 
developed [8]. In contrast to this forward mode the reverse mode was first used by 
Speelpenning [13] who passed the underlying graph top down and propagated the 
gradient backwards. 


PRB-102 @ CH.PRB- 5.3. 


1. State the definition of the derivative f(c) of a function f(x) atx = c. 


2. With respect to the DAG depicted in 5.3: 
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FIGURE 5.3: An expression graph for g(x). Constants are shown in gray, crossed-out since 
derivatives should not be propagated to constant operands. 


(a) Traverse the graph 5.3 and find the function g(x) it represents. 
(b) Using the definition of the derivative, find g'(9). 


PRB-103 @ CH.PRB- 5.4. 


1. With respect to the expression graph depicted in 5.4, traverse the graph and find the 
function g(x) it represents. 


(x) g 
(+2) 


FIGURE 5.4: An expression graph for g(x). Constants are shown in gray, crossed-out since 
derivatives should not be propagated to constant operands. 


2. Using the definition of the derivative find the derivative of g(x). 


5.2.4 The chain rule 


| PRB-104 @ CH.PRB- 5.5. 
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1. The chain rule is key concept in differentiation. Define it. 


2. Elaborate how the chain rule is utilized in the context of neural networks. 


5.2.5 Taylor series expansion 


The idea behind a Taylor series is that if you know a function and all its derivatives 
at one point x = a, you can approximate the function at other points near a. As an 
example, take f(x) = yz. You can use Taylor series to approximate v10 by knowing 
F(9) and all the derivatives f'(9), f”(9). 

The MacLaurin series (5.2) is a special case of Taylor series when f(0), f’(0) are 
known: 


f(z) = f(0) + 2f'(0) + af") + af") eee 


© yp 
SFO) (5.2) 
p=0 p: 
For instance, the Maclaurin expansion of cos(x) is: 
f(x) =c08x, f'(x)=-—sinzx, 
Fx) =—c0szx, f"(u) =sinz (5.3) 
When evaluated at 0 results in: 
r? zt gô 
costr =l- z | TG pee (5.4) 


PRB-105 @ CH.PRB- 5.6. 
Find the Taylor series expansion for: 


L 


(5.5) 
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2 
e” (5.6) 

3. 
sin(x) (5.7) 

4. 
cos(x) (5.8) 


PRB-106 @ CH.PRB- 5.7. 
Find the Taylor series expansion for: 


log(x) (5.9) 


PRB-107 @ CH.PRB- 5.8. 
Find the Taylor series expansion centered at x = —3 for: 


f(z) = 52? -—11¢+1 (5.10) 


PRB-108 @ CH.PRB- 5.9. 
Find the 101th degree Taylor polynomial centered at x = 0 for: 


f(x) = cos(x) (5.11) 


PRB-109 @ CH.PRB- 5.10. 
At x = 1, compute the first 7 terms of the Taylor series expansion of: 


Fu) = ln3z: (5.12) 
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5.2.6 Limits and continuity 


Theorem 1 (L’Hopital’s rule). 


10 a (5.13) 


ea g(a) 24 gía) 


PRB-110 O CH.PRB- 5.11. 
Find the following limits: 


ge 27 
Liu = = 
a3 3x—9 


oy) 
e” —a-1 


2. lim 
z>0 3cosxr—2x“—3 


3 li x—Iinz 
" joe Vx + 4 


5.2.7 Partial derivatives 


PRB-111 O CH.PRB- 5.12. 


1. True or false: When applying a partial derivative, there are two variables considered 
constants - the dependent and independent variable. 


2. Given g(x, y), find its partial derivative with respect to x: 


g(x,y) = xy + yx + 8y. (5.14) 


| PRB-112 @ CH.PRB- 5.13. 


130 


Chapter 5 DEEP LEARNING: CALCULUS, ALGORITHMIC DIFFERENTIATION 


The gradient of a two-dimensional function is given by 


Vf(<,y) = of, + i (5.15) 
1. Find the gradient of the function: 
fa,y) = ry? - y? +x* (5.16) 
2. Given the function: 
g(@,y) = z°y = ry” -y - 1, (5.17) 


evaluate it at (—1, 0), directed at (1,1). 


PRB-113 @ CH.PRB- 5.14. 
Find the partial derivatives of: 


f(x,y) = 3sin*(x — y) (5.18) 


PRB-114 O CH.PRE- 5.15. 
Find the partial derivatives of: 


z = 2sin(z) sin(y) (5.19) 


5.2.8 Optimization 


PRB-115 @ CH.PRB- 5.16. 


a? +1 
(z + 2)?" 
1. Where is f(x) well defined? 


Consider f(x) = 
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2. Where is f(x) increasing and decreasing? 


3. Where is f(x) reaching minimum and maximum values. 


PRB-116 @ CH.PRB- 5.17. 
Consider f(x) = 2x° — z. 


1. Derive f(x) and conclude on its behavior. 


2. Derive once again and discuss the concavity of the function f(x). 


PRB-117 O CH.PRB- 5.18. 


Consider the function 


f(x,y) = 22? —ay+y’, 


and find maximum, minimum, and saddle points. 


5.2.9 The Gradient descent algorithm 


PRB-118 @ CH.PRB- 5.19. 

The gradient descent algorithm can be utilized for the minimization of convex functions. 
Stationary points are required in order to minimize a convex function. A very simple ap- 
proach for finding stationary points is to start at an arbitrary point, and move along the 
gradient at that point towards the next point, and repeat until converging to a stationary 
point. 


1. What is the term used to describe the vector of all partial derivatives for a function 
f(x)? 


2. Complete the sentence: when searching for a minima, if the derivative is positive, the 
function is increasing/decreasing. 
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3. The function x? as depicted in 5.5, has a derivative of f'(x) = 2x. Evaluated at x = 
—1, the derivative equals f'(x = —1) = —2. At x = —1, the function is decreasing 
as x gets larger. We will happen if we wish to find a minima using gradient descent, 
and increase (decrease) x by the size of the gradient, and then again repeatedly keep 
jumping? 


4, How this phenomena can be alleviated? 


5. True or False: The gradient descent algorithm is guaranteed to find a local minimum 
if the learning rate is correctly decreased and a finite local minimum exists. 


4,04 
y 


3,0 4,0 


—1,0 


FIGURE 5.5: x? Function 


PRB-119 @ CH.PRB- 5.20. 


1. Is the data linearly separable? 


E Y 
1/14 

12 | 12 | — (5.20) 
415l- 

12 |12 | + 
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2. What is loss function for linear regression? 


3. What is the gradient descent algorithm to minimize a function f (x)? 


5.2.10 The Backpropagation algorithm 


The most important, expensive and hard to implement part of any hardware realiz- 
ation of ANNs is the non-linear activation function of a neuron. Commonly applied 
activation functions are the sigmoid and the hyperbolic tangent. In the most used 
learning algorithm in present day applications, back-propagation, the derivatives of 
the sigmoid function are needed when back propagating the errors. 

The backpropagation algorithm looks for the minimum of the error function in 
weight space using the method of gradient descent. 


PRB-120 @ CH.PRB- 5.21. 


1. During the training of an ANN, a sigmoid layer applies the sigmoid function to every 
element in the forward pass, while in the backward pass the chain rule is being util- 
ized as part of the backpropagation algorithm. With respect to the backpropagation 
algorithm, given a sigmoid a(x) = a activation function, and a J as the cost func- 
tion, annotate each part of equation (5.21): 


dJ do(s) 


dZ = 
do(x) dx 


= dA- a(z): (1 —0(2)) (5.21) 


2. Code snippet 5.6 provides a pure Python-based (e.g. not using Autograd) implement- 
ation of the forward pass for the sigmoid function. Complete the backward pass that 
directly computes the analytical gradients. 
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class Sigmoid: 
def forward(self,x): 
self.x = x 
return 1/(1+np.exp(-x)) 
def backward(self, grad): 
grad_input = [1277] 


N Dd oO B® WO Noa 


return grad_input 


FIGURE 5.6: Forward pass for the sigmoid function. 


PRB-121 @ CH.PRB- 5.22. 

This question deals with the effect of customized transfer functions. Consider a neural 
network with hidden units that use z? and output units that use sin(2x) as transfer func- 
tions. Using the chain rule, starting from OE /Oyx, derive the formulas for the weight updates 
Awjx and Aw;;. Notice - do not include partial derivatives in your final answer. 


5.2.11 Feed forward neural networks 


Understanding the inner-workings of Feed Forward Neural Networks (FENN) is 
crucial to the understanding of other, more advanced Neural Networks such as CNN’s. 


A Neural Network (NN) is an interconnected assembly of simple processing 
elements, units or nodes, whose functionality is loosely based on the animal 
neuron. The processing ability of the network is stored in the inter-unit 
connection strengths, or weights, obtained by a process of adaptation to, or 
learning from, a set of training patterns. [6] 


The Backpropagation Algorithm is the most widely used learning algorithm for 
FFNN. Backpropagation is a training method that uses the Generalized Delta Rule. Its 
basic idea is to perform a gradient descent on the total squared error of the network 
output, considered as a function of the weights. It was first described by Werbos and 
made popular by Rumelhart's, Hinton’s and Williams’ paper [12]. 
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5.2.12 Activation functions, Autograd/JAX 


Activation functions, and most commonly the sigmoid activation function, are 
heavily used for the construction of NNs. We utilize Autograd ([10]) and the recently 
published JAX ([1]) library to learn about the relationship between activation func- 
tions and the Backpropagation algorithm. 

Using a logistic, or sigmoid, activation function has some benefits in being able 
to easily take derivatives and then interpret them using a logistic regression model. 
Autograd is a core module in PyTorch ([11]) and adds inherit support for automatic 
differentiation for all operations on tensors and functions. Moreover, one can imple- 
ment his own custom Autograd function by sub classing the autograd Function and 
implementing the forward and backward passes which operate on PyTorch tensors. 
PyTorch provides a simple syntax (5.7) which is transparent to both CPU/GPU sup- 
port. 


import torch 

from torch.autograd import Function 
class DLFunction (Function): 
@staticmethod 

def forward(ctx, input): 


@staticmethod 
def backward(ctx, grad output): 


FIGURE 5.7: PyTorch syntax for autograd. 


PRB-122 @ CH.PRB- 5.23. 


1. True or false: In Autograd, if any input tensor of an operation has requires_grad=True, 
the computation will be tracked. After computing the backward pass, a gradient w.r.t. 
this tensor is accumulated into .grad attribute 
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2. True or false: In Autograd, multiple calls to backward will sum up previously com- 
puted gradients if they are not zeroed. 


PRB-123 @ CH.PRB- 5.24. 

Your friend, a veteran of the DL community wants to use logistic regression and im- 
plement custom activation functions using Autograd. Logistic regression is used when the 
variable y that we want to predict can only take on discrete values (i.e. classification). Con- 
sidering a binary classification problem (y = 0 or y = 1) (5.8), the hypothesis function could 
be defined so that it is bounded between [0, 1] in which we use some form of logistic function, 
such as the sigmoid function. Other, more efficient functions exist such as the ReLU (Rec- 
tified Linear Unit) which we discussed later. Note: The weights in (5.8) are only meant for 
illustration purposes and are not part of the solution. 


inputs weights 


Summation Activation (0 0 


FIGURE 5.8: A typical binary classification problem. 


1 


. Given the sigmoid function: g(x) = what is the expression for the corresponding 


14+e7? 
hypothesis in logistic regression? 


. What is the decision boundary? 
. What does he (x) = 0.8 mean? 


. Using an Autograd based Python program, implement both the forward and backward 


pass for the sigmoid activation function and evaluate it’s derivative at x = 1 


5.2. PROBLEMS 


5. Using an Autograd based Python program, implement both the forward and backward 
pass for the ReLU activation function and evaluate it's derivative at x = 1 


PRB-124 € CH.PRB- 5.25. 
For real values, —1 < x < 1 the hyperbolic tangent function is defined as: 


tanh”! x = > Inf + x) — In(1 — 2)] (5.22) 


On the other hand, the artanh function, which returns the inverse hyperbolic tangent of 
its argument x, is implemented in numpy as arctanh(). 
Its derivative is given by: 


1 


= —— .2 
Iz (5.23) 


(arctanh(x) y 


Your friend, a veteran of the DL community wants to implement a custom activation 
function for the arctanh function using Autograd. Help him in realize the method. 


1. Use this numpy array as an input [[0.37, 0.192, 0.571)] and evaluate the result using 
pure Python. 


2. Use the PyTorch based torch.autograd. Function class to implement a custom Func- 
tion that implements the forward pass for the arctanh function in Python. 


3. Use the PyTorch based torch.autograd. Function class to implement a custom Func- 
tion that implements the backward pass for the arctanh function in Python. 


4, Name the class ArtanhFunction, and using the gradcheck method from torch.autograd, 
verify that your numerical values equate the analytical values calculated by gradcheck. 
Remember you must implement a method entitled .apply(x) so that the function can 
be invoked by Autograd. 


5.2.13 Dual numbers in AD 
Dual numbers (DN) are analogous to complex numbers and augment real numbers 
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with a dual element by adjoining an infinitesimal element d, for which d? = 0. 


PRB-125 @ CH.PRB- 5.26. 
1. Explain how AD uses floating point numerical rather than symbolic expressions. 
2. Explain the notion of DN as introduced by ([2]). 


3. What arithmetic operations are possible on DN?. 


4, Explain the relationship between a Taylor series and DN. 


PRB-126 O CH.PRB- 5.27. 
1. Expand the following function using DN: 


sin(x + td) (5.24) 


2. With respect to the expression graph depicted in 5.9: 


mii 
z 2—0 


FIGURE 5.9: An expression graph for g(x). Constants are shown in gray, crossed-out since 
derivatives should not be propagated to constant operands. 


(a) Traverse the graph 5.9 and find the function g(x) it represents. 
(b) Expand the function g(x) using DN. 


3. Show that the (general identity); 


g(a + td) = g(x) + 9 (a)id (5.25) 


139 | 


5.2. PROBLEMS 


holds in this particular case too. 


4. Using the derived DN, evaluate the function g(x) at x = 2. 


5. Using an Autograd based Python program implement the function and evaluate it’s 
derivative at x = 2. 


PRB-127 @ CH.PRB- 5.28. 
With respect to the expression graph depicted in 5.10: 


FIGURE 5.10: An expression graph for g(x). Constants are shown in gray, crossed-out 
since derivatives should not be propagated to constant operands. 


1. Traverse the graph 5.10 and find the function g(x) it represents. 
Expand the function g(x) using DN. 


Using the derived DN, evaluate the function g(x) at x = 5. 


A pp N 


Using an AutoGrad based Python program implement the function and evaluate it's 
derivative at x = 5. 


5.2.14 Forward mode AD 


| PRB-128 O CH.PRB- 5.29. 
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When differentiating a function using forward-mode AD, the computation of such an 
expression can be computed from its corresponding directed a-cyclical graph by propagating 
the numerical values. 


1. Find the function, g(A, B, C) represented by the expression graph in 5.11. 


+ g (A, B,C)) 


FIGURE 5.11: A computation graph for g(x) 


2. Find the partial derivatives for the function g(x). 


PRB-129 @ CH.PRB- 5.30. 
Answer the following given that a computational graph of a function has N inputs and 
M outputs. 


1. True or False?: 


(a) Forward and reverse mode AD always yield the same result. 


(b) In reverse mode AD there are fewer operations (time) and less space for interme- 
diates (memory). 


(c) The cost for forward mode grows with N. 


(d) The cost for reverse mode grows with M. 


| PRB-130 @ CH.PRB- 5.31. 
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1. Transform the source code in code snippet 5.1 into a function g(x, £2). 


CODE 5.1: A function, g(x, x2) in the C programming language. 


float g( float x1 , float x2) { 
float vil, v2, v3 , v4, v5; 
vi=x1; 

V2=X2; 

v3 = vl * v2; 

v4 = in (vil ); 

v5 = v3 + v4; 

return v5; 


we œ% N O QM A w nta 


2. Transform the function g(x,,x2) into an expression graph. 


3. Find the partial derivatives for the function g(x1, £2). 


5.2.15 Forward mode AD table construction 


PRB-131 O CH.PRB- 5.32. 


1. Given the function: 
f (£1, £2) Seite In (x1) (5.26) 


and the graph 5.1, annotate each vertex (edge) of the graph with the partial derivatives 
that would be propagated in forward mode AD. 


2. Transform the graph into a table that computes the function: 
g(x,, z2) evaluated at (x1; x2) = (e°; 7) using forward-mode AD. 


3. Write and run a Python code snippet to prove your results are correct. 


4. Describe the role of seed values in forward-mode AD. 
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5. Transform the graph into a table that computes the derivative of g(x, £2) evalu- 
ated at (x1; x2) = (e°; m) using forward-mode AD for x, as the chosen independent 
variable. 


6. Write and run a Python code snippet to prove your results are correct. 


5.2.16 Symbolic differentiation 


In this section, we introduce the basic functionality of the SymPy (SYMbolic Python) 
library commonly used for symbolic mathematics as a means to deepen your under- 
standing in both Python and calculus. If you are using Sympy in a Jupyter notebook 
in Google Colab (e.g. https://colab.research.google.com/) then rendering 
sympy equations requires MathJax to be available within each cell output. The follow- 
ing is a hook function that will make this possible: 


CODE 5.2: Sympy in Google Colab 


ps 


from IPython.display import Math, HTML 

def enable_sympy_in_cell(): 

display (HTML ("<script 

Se rs cons me louGdmliagencom/agias/ bsi 
"mathjax/2.7.3/latest.js?config=default'> 

<j Sickesjo >!) )) 
get_ipython().events.register('pre_run_cell', 

+ enable sympy in cell 


N 


w 


MS 


al 


a 


After successfully registering this hook, SymPy rendering (5.3) will work correctly: 


CODE 5.3: Rendering Sympy in Google Colab 


import sympy 

from sympy import x 

aaae oeste aoe) 

Mp Wa 4% = Syalooll (Ux we wy) 
acera (Sept (l/s), (Esp Wa 98) ) 


ao e Ww N e 
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It is also recommended to use the latest version of Sympy: 


CODE 5.4: Updating Sympy 


> pip install -—-upgrade sympy 


5.2.17 Simple differentiation 


PRB-132 O CH.PRB- 5.33. 
Answer the following questions: 


1. Which differentiation method is inherently prone to rounding errors? 


2. Define the term symbolic differentiation. 


PRB-133 @ CH.PRB- 5.34. 
Answer the following questions: 


1. Implement the sigmoid function o(x) = ¡22 symbolically using a Python based 
SymPy program. 


2. Differentiate the sigmoid function using SymPy and compare it with the analytical 
derivation o' (x) = o(x)(1 — o(x)). 


3. Using SymPy, evaluate the gradient of the sigmoid function at x = 0. 


4, Using SymPy, plot the resulting gradient of the sigmoid function. 


5.2.18 The Beta-Binomial model 


| PRB-134 @ CH.PRB- 5.35. 
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You will most likely not be given such a long programming task during a face-to-face 
interview. Nevertheless, an extensive home programming assignment is typically given at 
many of the start-ups I am familiar with. You should allocate around approximately four to 
six hours to completely answer all questions in this problem. 

We discussed the Beta-Binomial model extensively in chapter 3. Recall that the Beta- 
Binomial distribution is frequently used in Bayesian statistics to model the number of suc- 
cesses in n trials. We now employ SymPy to do the same; demonstrate computationally how 
a prior distribution is updated to develop into a posterior distribution after observing the 
data via the relationship of the Beta-Binomial distribution. 

Provided the probability of success, the number of successes after n trials follows a bino- 
mial distribution. Note that the beta distribution is a conjugate prior for the parameter of 
the binomial distribution. In this case, the likelihood function is binomial, and a beta prior 
distribution yields a beta posterior distribution. 

Recall that for the Beta-Binomial distribution the following relationships exist: 


Prior of 0 Beta(a,b) 

Likelihood binomial (n, 0) 
Posterior of 0 Beta (a+ x,b +n — z) 
Posterior Mean | (a+x)/(a+b+mn-— z) 


(5.27) 


1. Likelihood: The starting point for our inference problem is the Likelihood, the prob- 
ability of the observed data. Find the Likelihood function symbolically using sympy. 
Convert the SymPy representation to a purely Numpy based callable function with a 
Lambda expression. Evaluate the Likelihood function at 0 = 0.5 with 50 successful 
trials out of 100. 


2. Prior: The Beta Distribution. Define the Beta distribution which will act as our prior 
distribution symbolically using sympy. Convert the SymPy representation to a purely 
Numpy based callable function. Evaluate the Beta Distribution at 0 : 0.5,a:2,b: 7 


3. Plot the Beta distribution, using the Numpy based function. 


4, Posterior: Find the posterior distribution by multiplying our Beta prior by the Bi- 
nomial Likelihood symbolically using sympy. Convert the SymPy representation to 
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a purely Numpy based callable function. Evaluate the Posterior Distribution at 0 : 
0.5,a:2,b:7 


5. Plot the posterior distribution, using the Numpy based function. 


6. Show that the posterior distribution has the same functional dependence on 0 as the 
prior, and it is just another Beta distribution. 


7. Given: 
Prior : Beta(0|a = 2,b = 7) = 560 (—@ + 1) and: 
Likelihood : Bin(r = 3|n = 6,0) = 1960063 (—6 + 1)“ find the resulting posterior 
distribution and plot it. 


5.3 Solutions 
5.3.1 Algorithmic differentiation, Gradient descent 


5.3.2 Numerical differentiation 


SOL-100 Y CH.SOL- 5.1. 
1. The formulae is: 


fle +h) = fa) 


Fa) EA 


(5.28) 


2. The main problem with this formulae is that it suffers from numerical instability for 
small values of h. 


3. In some numerical software systems, the number \/2 may be represented as the a float- 
ing point number = 1.414213562. Therefore, the result of: 


float (ye) x float (ya) may equal = 2.000000446. 


sori CH.SOL- 5.2. 
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1. The instantaneous rate of change equals: 


h>0 a+h-—a 


2. The instantaneous rate of change of f(x) at a is also commonly known as the tangent 
line of f(x) at a. 


3. Given a function f(x) and a point a, the tangent (Fig. 5.12) line of f(x) at a is a line 
that touches f(a) but does not cross f(x) (sufficiently close to a). 


Y 


FIGURE 5.12: A Tangent line 


5.3.3 Directed Acyclic Graphs 


Isor-102 CH.SOL- 5.3. 
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1. The definition is: 


(5.30) 


am L/V9 FR = V9 


h—0 h 
tim VO V9+h 
= a09- YIFh-h 


Pos 


tin B- VIFM + VI FR) 
—>03/9+h-(3+YV9+h)-h 
9—(9+h) 


= li 
2090 /IFhR-h+3-(9+h)-h 
1 


~ 9.3+3-9 
1 


54 


SOL-103 UY CH.SOL- 5.4. 


1. The function g(x) = 24? — x + 1 represents the expression graph depicted in 5.4. 
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2. By the definition: 
e fe+h)- fle) 


h>0 I+h-x 
Ax+h)?—(24+h)+1-20*4+x-1 
m 


h=>0 h 

 2La?+210h+h?%)-x—h+1-212+x-1 
= lim 

h>0 h 

E A A i ee eS (5.31) 
= lim 

h=>0 h 

Agh. + 2h? = h 

= lim 

h=>0 h 


= lim 4z + 2h — 1 
h—>0 


= 4x —1. 


5.3.4 The chain rule 


SOL-104 Y CH.SOL- 5.5. 


1. The chain rule states that the partial derivative of E = E(x, y) with respect to x can be 
calculated via another variable y = y(x), as follows: 


OE _ OE oy 
Ox y Ox 


(5.32) 


2. For instance, the chain rule [8] is applied in neural networks to calculate the change in 
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its weights resulting from tuning the cost function. This derivative is calculated via a 
chain of partial derivatives (e.g. of the activation functions). 


5.3.5 Taylor series expansion 


SOL-105 Y CH.SOL- 5.6. 


1. 
La Paisa 
l-r a0 
(when —1 < x < 1) (5.33) 
2. 
E 00 qa q? r? 
e = 2 Al = 1 rv 9] T 31 ¡A (5.34) 
3. 
: SD onp g o w 
4, 
E love) (-1) ön a q? xt 
COS £ 2 En)! x 1 z + T (5.36) 


so1-106 CH.SOL- 5.7. 
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00 21] ypp+1 g— 1)” a — 1) 
log = So | | ) Ce ee s + 
n=l (5.37) 
Glee), 
3 4 


SOL-107 UY CH.SOL- 5.8. 
In this case, all derivatives can be computed: 


Pr) = 52? — 112 +1, 


PL) = 79, 

f(x) = 10x — 11, 
f'(-38) = -41, (5.38) 

f(x) = 10, 

f?(—3) = 10, 

I" (¢)S0, Yn >3. 


SOL-108 y CH.SOL- 5.9. 
The immediate answer is 1. Refer to eq. 5.36 to verify this logical consequence. m 


SOL-109 Ud CH.SOL- 5.10. 
By employing eq. 5.37, one can substitute x by 3 — x and generate the first 7 terms of the 
x-dependable outcome before assigning the point x = 1. 
a 


5.3.6 Limits and continuity 


soL=10 CH.SOL- 5.11. 
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1. With an indeterminate form 0/0, L’Hopital’s rule holds. We look at 


which equals to the original limit. 
2. Again, we yield 0/0 at interim, so we look at the first order derivative 


. 2xe” — 1 
lim ——————- = 1 
z>0 —3sinx — 1 


The original limit is also equal to 1. 


3. This time, the intermediate form is of 00/00 and L'Hopital applies as well. The quotient 
of the derivatives is 
1-1 
0.01:099/100 


As x — œ, this goes to oo, so the original limit is equal to oo also. 


= 100(x — 1)x1/ 


5.3.7 Partial derivatives 


SOL-111 Y CH.SOL- 5.12. 
1. True. 


2. By treating y as constant, one can derive that 


09 = 27y + y. (5.39) 
Ox 


sou CH.SOL- 5.13. 
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o o 
Vrey) = Kir j ah 


= (y + 32”) i + (Qxry — 2y) j 


2. It can be shown that Vg(x,y) = (2xy + y?) 1 + (x? + 2xy — 1) j at (—1,0) equals 
(0,0). According to the definition of directional derivative: 


(0,0) = 1) 
1 


1 
aD =0 (5.41) 


SOL-113 Y CH.SOL- 5.14. 


of = 6sin(x — y) cos(x — y) 
a (5.42) 
By = —6sin(x — y) cos(x — y) 


SOL-114 YY CH.SOL- 5.15. 


Oz 
Ox 


z 
— = 2sin g cos 
Oy Y 


= 2 cos 1 sin y 
(5.43) 


5.3.8 Optimization 


sou-ns CH.SOL- 5.16. 
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5.3. SOLUTIONS 


1. The function is only defined where x 4 —2, in the domain of: 
(—o0, —2) U (-2, +00). 


2. By a simple quotient-based derivation: 


2(x + 2)(2x — 1) 
(x + 2)4 


eS (5.44) 


Namely, expect for the ill-defined x = —2, the critical point of x = 0.5 should be 
considered. For x > 0.5, the derivative is positive and the function increases, in contrast 
tox < 0.5. 


3. The requested coordinate is (0.5, 0.2). 


SOL-116 Y CH.SOL- 5.17. 


1. f'(x) = 6x? — 1, which entails the behavior of the function changes around the points 
x= Eor The derivative is negative between x = -7 and x = Ter i.e., it decreases 


in the domain, and increases otherwise. 


2. The second derivative is f" (x) = 12x, which means the function is concave for negative 
x values and convex otherwise. 


SOL-117 & CH.SOL- 5.18. 
The function should be derived according to each variable separately and be equated to 0, 
as follows: 


falo, y) =4a-—y=0, fy(z,y)=—-yt+2y=0. 


So, the solution to these equations yield the coordinate (0,0), and f(0,0) = 0. 
Let us derive the second order derivative, as follows: 


Pf Pf Of 
gay) = 4; Bye ew) =2, aroy 9) = aly 


154 


Chapter 5 DEEP LEARNING: CALCULUS, ALGORITHMIC DIFFERENTIATION 


Also, the following relation exists: 


FPF (Pf) 
aU) Ox? Oy2 — (25) lá: 
Thus, the critical point (0, 0) is a minimum. a 


5.3.9 The Gradient descent algorithm 


SOL-118 Y CH.SOL- 5.19. 


1. It is the gradient of a function which is mathematically represented by: 


Of (x,y) 
V f(x, y) = OF ay) (5.45) 
Oy 
2. Increasing. 


3. We will keep jumping between the same two points without ever reaching a minima. 


4, This phenomena can be alleviated by using a learning rate or step size. For instance, 
x+ = 2 x where n is a learning rate with small value such as n = 0.25. 


5. True. 


SOL-119 UY CH.SOL- 5.20. 


1. The point (12,12) has two classes, so the classes cannot be separated by any line. 


2, 


m 


J(0) = a 5 (ĝi — y)” (5.46) 


i=1 
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5.3. SOLUTIONS 


3. Simple but fundamental algorithm for minimizing f. Just repeatedly move in the direc- 
tion of the negative gradient 
(a) Start with initial guess 0), step size n 
(b) For k = 1,2,3,...: 
i. Compute the gradient V f (0®78) 
ii. Check if gradient is close to zero; is so stop, otherwise continue 
iii. Update 0) = 6-) — nV f(0%-Y) 


(c) Return final 9% as approximate solution 0* 


5.3.10 The Backpropagation algorithm 


SOL-120 Y CH.SOL- 5.21. 


1. The annotated parts of equation (5.21) appear in (5.47): 


= The Sigmoid activation function 


a(x) - (1 — o(z)) 
The deriviative of the Sigmoid activation function = 
1Z = The input (5.47) 
dZ = The error introduced by input Z. 
A = The output 
dA = The error introduced by output A. 


— L+e 


2. Code snippet 5.13 provides an implementation of both the forward and backward passes 
for the sigmoid function. 


156 


Chapter 5 DEEP LEARNING: CALCULUS, ALGORITHMIC DIFFERENTIATION 


class Sigmoid: 

def forward(self,x): 
self.x = x 
return 1/(1+np.exp(-x)) 


def backward(self, grad): 
gradminpuc ik Sel pes (Sas ss) xe Gpeeiel 


oN o oO F BW N BR 


return grad_input 


FIGURE 5.13: Forward and backward passes for the sigmoid activation function in pure 
Python. 


SOL-121 YY CH.SOL- 5.22. 
The key concept in this question is merely understanding that the transfer function and 
its derivatives are changing compared to traditional activation functions, namely: 


aun = (yk — dy) (5.48) 


= = (yx — dy) - 2.cos(2net,) (5.49) 


= =N - (Yk — de) - 2.cos(2net,) - y; (5.50) 


OE OE netk OE 
Oy; 3 (= i Oy; 7 2 (= ws) (ol 


OE OE Oy; _ OE 
net; y; net; Oy; 


-3net; (5.52) 
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5.3. SOLUTIONS 


Aw = — ge 


— OE Onet; 
= e q (5.53) 


= =n: (Er [ye — de) + 2c05(2net) - 031) < 3net? - y; 


5.3.11 Feed forward neural networks 


5.3.12 Activation functions, Autograd/JAX 


SOL-122 Y CH.SOL- 5.23. 


1. True. 
2. True. 


SOL-123 Y CH.SOL- 5.24. 
The answers are as follows: 


1. hola) =4(0x)= gr . 
l+e7 2 


2. The decision boundary for the logistic sigmoid function is where he(x) = 0.5 (values 
less than 0.5 mean false, values equal to or more than 0.5 mean true). 


3. That there is a 80% chance that the instance is of the corresponding class, therefore: 
e holz) = g(Oo + O11 + O2x2). We can predict y = 1 if xp + 21 + 22 = 0. 


4, The code snippet in 5.14 implements the function using Autograd. 
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from torch.autograd import Function 
class Sigmoid (Function): 
@staticmethod 

def forward(ctx, xX): 
output = 1 / (1 + torch.exp{-x]) 
ctx.save_for_backward (output) 
return output 


@staticmethod 
def backward(ctx, grad output): 
output, = Ctx.saved_tensors 


Special Pale) = Cee oe oe 
return grad_x 


Vo ON A. © wea N DH 0 F WOW N E 


FIGURE 5.14: Forward and backward for the sigmoid function in Autograd. 


5. The code snippet in 5.15 implements the function using Autograd. 
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5.3. SOLUTIONS 


1|f£rom torch.autograd import Function 
2|class ReLU(torch.autograd.Function): 
3| @staticmethod 

4| def forward(ctx, input): 

5 ctx.save_for_backward (input) 

6 return input.clamp (min=0) 


s| @staticmethod 
9| def backward(ctx, grad_output): 


0 input, = ctx.saved_tensors 
1 grad_input = grad_output.clone() 
2 grad_input [input < 0] = 0 


3 return grad_input 


FIGURE 5.15: Forward and backward for the ReLU function in Autograd. 


SOL-124 Y CH.SOL- 5.25. The answers are as follows: 


1. Code snippet 5.16 implements the forward pass using pure Python. 
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import numpy as np 

<I = cence clos) (eonen, remo (OS 0,12 0D Vii, 
requires_grad=True)).type (torch.DoubleTensor) 
xT_np=xT.detach () .cpu() .numpy () 

prinë ("Topal: \a", <I op) 

arctanh_values = np.arctanh(xT_np) 

print ("Numpy:", arctanh_values) 

> Numpy: [[0.38842311 0.1944129 0.64900533]] 


co N an a e œ N = 


FIGURE 5.16: Forward pass for equation (5.23) using pure Python. 


2. Code snippet 5.17 implements the forward pass using Autograd. 


import torch 
from torch.autograd import Function 
class ArtanhFunction (Function): 

@staticmethod 

def forward(ctx, xX): 
ctx.save_for_backward (x) 
ve = (Eoceno (il a> Es) euo oren Lore (il = 29)))) cial (104.5) 
return r 


00 N a a > U N = 


FIGURE 5.17: Forward pass for equation (5.23). 


3. Code snippet 5.18 implements the backward pass using Autograd. 
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5.3. SOLUTIONS 


pa 


a 


a 


= 


iss) 


a 


from torch.autograd import Function 
class ArtanhFunction (Function): 


@staticmethod 

input, = ctx.saved_tensors 

out= grad output / (1 -~ input + 2) 
print ("backward:()".format (out) ) 


return out 


FIGURE 5.18: Backward pass for equation (5.23). 


4. Code snippet 5.19 verifies the correctness of the implementation using gradcheck. 


import numpy as np 


xT = 


= torch -abs (Gorche censor (ORI 0T O A reguirss grad- rrue) 


. type (torch.DoubleTensor) 

arctanh_values_torch = arctanhPyTorch (xT) 
print ("“Torch:”, aretan values torch) 

from torch.autograd import gradcheck, Variable 
f = ArtanhFunction.apply 

test=gradcheck (lambda t: f(t), xT) 

print (test) 


a Py lorceh aversion a7 20 
> Torch: tensor([[0.3884, 0.1944, 0.6490]], dtype=torch.f 
> grad_fn=<ArtanhFunctionBackward>) 


loat64, 


= backward tensor ASS MOS lA SS ip cdeype-eOreh ss 
grad_fn=<CopyBackwards>) 


Float64, 


FIGURE 5.19: Invoking arctanh using gradcheck 
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5.3.13 Dual numbers in AD 


SOL-125 & CH.SOL- 5.26. 
The answers are as follows: 


1. The procedure of AD is to use verbatim text of a computer program which calculates 
a numerical value and to transform it into the text of a computer program called the 
transformed program which calculates the desired derivative values. The transformed 
computer program carries out these derivative calculations by repeated use of the chain 
rule however applied to actual floating point values rather than to a symbolic rep- 
resentation. 


2. Dual numbers extend all numbers by adding a second component x ++ x + ¿id where 
x + & is the dual part. 


3. The following arithmetic operations are possible on DN: 
(a) d? =0 
(b) (x+ id) + (y + ġd) =z +y + (t+ ğġ)d 
(c) —(x + td) = —x — id 
(d) H=a=:->é 


4, For f(x + id) the Taylor series expansion is: 


f'(@) 


1! 


f(x + itd) = f(a) + gd+...0 (5.54) 


The and important result is that all higher-order terms (n >= 2) disappear 
which provides closed-form mathematical expression that represents a function and its 
derivative. 


sox-as CH.SOL- 5.27. 
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5.3. SOLUTIONS 


The answers are as follows: 


1. 


sin(x + td) = sin(x) + cos(x)td (5.55) 


2. If we traverse the graph 5.9 from left to right we drive the following simple function: 


n= see +2 (5.56) 

3. We know that: 
g(1)=3xx+2 (5.57) 
g(a)=3 (5.58) 


Now if we expand the function using DN: 


g(a + id) =3x (x + td) + 2 = (5.59) 
3 xx +3x* (1d) + 2 (5.60) 

Rearranging: 
3 xr +2+3x (td) (5.61) 


But since g(x) = 3 * x + 2 then: 
g(x + id) = g(x) + 9 (ajid (5.62) 
4. Evaluating the function g(x) at x = 2 using DN we get: 


g(x = 2) = (3 x 2 + 2) + (3)td = (5.63) 
8 + (3)id (5.64) 


5. The code snippet in 5.20 implements the function using Autograd. 
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import autograd.numpy as np 
from autograd import grad 

x = np.array([2.0], dtype=float) 
def f1(x): 

return 3xx + 2 

graditi = grad(f1) 

print (f1(x)) + 

print (grad_fl ( 


oN Dd a F BW N e 


FIGURE 5.20: Autograd 


SOL-127 Y CH.SOL- 5.28. The answers are as follows: 


1. If we traverse the graph 5.9 from left to right we drive the following function: 


g(x) =5ea*+4exr41 (5.65) 

2. We know that: 
gx) =5*a?+4exe+1 (5.66) 
g (11) =10*2,+4 (5.67) 


Now if we expand the function using DN we get: 


g(x + id) = 5 * (x + td)? + 4x (x +d) +1= (5.68) 
5x (042x304 4d+(1d)) +4+x+4x (2d) +1 (5.69) 


165 | 


5.3. SOLUTIONS 


oN D a Ae UO N e 


However by definition (d?) = 0 and therefore that term vanishes. Rearranging the 


terms: 
(5x27? +4*xr+1)+(1l0*2+4)id 
But since g(x) = (5 * £? +4* x +1) then: 
g(x + id) = g(x) + g (ajid 
3. Evaluating the function g(x) at x = 5 using DN we get: 


g(a =4) = (5*57+4%5+1)+ (10x 5 + 4)¢d = 
146 + (54)4d 


4. The code snippet in 5.21 implements the function using Autograd. 


(5.70) 


(5.71) 


(5.72) 


import autograd.numpy as np 
from autograd import grad 

x = np.array([5.0], dtype=float) 
def f1(x): 

return 5*x**2 + 4xx +1 

grad_f1 = grad(f1) 

print (f1(x)) > 146.0 

print (qradeel(<)) A >) 5450 


FIGURE 5.21: Autograd 


5.3.14 Forward mode AD 


SOL-128 Y CH.SOL- 5.29. 


The answers are as follows: 
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1. The function g(x) represented by the expression graph in 5.11 is: 


g(x) =A+BxIn(C) (5.73) 
2. For a logarithmic function: 
d 1 
== == 5.74 
= In(z) == (574) 


CET (5.75) 


SOL-129 @ CH.SOL- 5.30. The answers are as follows: 


1. True. Both directions yield the exact same results. 

2. True. Reverse mode is more efficient than forward mode AD (why?). 
3. True. 

4. True. 


SOL-130 YY CH.SOL- 5.31. 
The answers are as follows: 


167 | 


5.3. SOLUTIONS 


1. The function is 


f (£1, £2) = 1122 + In (x1) (5.76) 


No 


. The graph associated with the forward mode AD is as follows: 


OM 


a— 


FIGURE 5.22: A Computation graph for g(11,x2) in 5.1 


Ge) 


3. The partial derivatives are: 


of 

Or. ~~ (x1) 
of” (5.77) 
Ox = 


5.3.15 Forward mode AD table construction 


SOL-131 YY CH.SOL- 5.32. 
The answers are as follows: 


1. The graph with the intermediate values is depicted in (5.23) 
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FIGURE 5.23: A derivative graph for g(x1, £2) in 5.1 


2. Forward mode AD for g (x1, £2) = In (x1) + 1122 evaluated at (x1, x2) = (e?, T). 


Forward-mode function evaluation 


U_1= T1 =e 

Vo = T2 =T 

vy =lnv =ln (e) =2 

Ug =0_1XV =e? x rt = 23.2134 


v3 =u+v 24 23.2134 = 25.2134 


F =i =~ 25.2134 
TABLE 5.1: Forward-mode AD table for y = g(21, £2) = In(x1)+2122 evaluated at (11,12) = 


(e?; T) and setting +1 = 1 to compute g 
1 


3. The following Python code (5.24) proves that the numerical results are correct: 
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5.3. SOLUTIONS 


import math 
print (math.log(math.exmath.e) + math.exmath.exmath.pi) 
S 25, 2 Sao 


pa 


N 


w 


FIGURE 5.24: Python code- AD of the function g(21, x2) 


4, Seed values indicate the values by which the dependent and independent variables are 
initialized to before being propagated in a computation graph. For instance: 


C le 
Ú > 922 —( 
2 Ori 


Therefore we set tı = 1 to compute Le. 
T1 


5. Here we construct a table for the forward-mode AD for the derivative of f (x1, x2) = 
In (11) + 2122 evaluated at (x1, £2) = (e°, 7) while setting tı = 1 to compute Ie In 
forward-mode AD a derivative is called a tangent. 


In the derivation that follows, note that mathematically using manual differentiation: 


ay [In(x) + 222] 


= £ [ln(z1)] + z2- lal 
= parl 

= Í 

= Ti + Zo 

a 
u-i 


and also since “ In(x) = 4 then i, = 


== Kat = b1/v1 = ae l= 1/2. 
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Forward-mode AD derivative evaluation 


Vvo =Ta2=T 

VU 1=%,= 

VO =i2=0 

. Lg = 3 

dy =0_1/v_1 =1/e 

v2 = U_1 X Vo + ùo X 
vi=1xTr+0x 
e =r 


Va = b1 +Ù = 1/e? + 


TABLE 5.3: Forward-mode AD table for y = g(x1, £2) = ln(x1)+z1%£2 evaluated at (11,12) = 
(e?; 7) and setting ¿1 = 1 (seed values are mentioned here: 3) to compute de” 


6. The following Python code (5.25) proves that the numerical results are correct: 
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5.3. SOLUTIONS 


import autograd.numpy as np 
from autograd import grad 
import math 


xl = math.ex math.e 
x2 = math.pi 


def ENEI x2): 
return (np.log(x1) + x1xx2) 


grad_fl = grad(f1) 


print (iiO aid 
print (gua denia IS 2/01 


Bou NN FP OC © oO N 0 0 FF 0 NY 


FIGURE 5.25: Python code- AD of the function g(x1, x2) 


5.3.16 Symbolic differentiation 


5.3.17 Simple differentiation 


SOL-132 y CH.SOL- 5.33. 
The answers are as follows: 


1. Approximate methods such as numerical differentiation suffer from numerical instabil- 
ity and truncation errors. 


2. In symbolic differentiation, a symbolic expression for the derivative of a function is 
calculated. This approach is quite slow and requires symbols parsing and manipulation. 
For example, the number \/2 is represented in SymPy as the object Pow(2,1/2). Since 
SymPy employees exact representations Pow(2,1/2)*Pow(2,1/2) will always equal 2. 
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SOL-133 Y CH.SOL- 5.34. 


1. First: 


import sympy 
sympy.init_printing() 

from sympy import Symbol 
from sympy import diff, exp, 
y = mo yy") 


Sal 


N an a e œQ N — 


sqrt 


y = sympy.Symbol ("y") 
sigmoid = 1/(1+sympy.exp(-y))**I 

FIGURE 5.26: Sigmoid in SymPy 
2. Second: 


= 


sig_der=sym.diff (sigmoid, y) 


FIGURE 5.27: Sigmoid gradient in SymPy 


3. Third: 


= 


sig_der.evalf (subs={y:0}) 
= W425 


N 


FIGURE 5.28: Sigmoid gradient in SymPy 
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5.3. SOLUTIONS 


4. The plot is depicted in 5.29. 


1|p = sym.plot (sig_der); 


f(y) 


0.25 


FIGURE 5.29: SymPy gradient of the Sigmoid() function 


5.3.18 The Beta-Binomial model 


SOL-134 UY CH.SOL- 5.35. 
To correctly render the generated LaTeX in this problem, we import and configure several 
libraries as depicted in 5.30. 
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import numpy as np 

import scipy.stats as st 

import matplotlib.pyplot as plt 

import sympy as sp 

sp.interactive.printing. 

init_printing (use_latex=True) 

from IPython.display import display, Math, Latex 
maths = lambda s: display (Math (s)) 

latex = lambda s: display (Latex(s))^^I 


NV oN Dd 0 F BW Nm. 


FIGURE 5.30: SymPy imports 


1. The Likelihood function can be created as follows. Note the specific details of generating 
the Factorial function in SymPy. 
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5.3. SOLUTIONS 


n = sp.Symbol('n', integer=True, positive=True) 

r = sp.Symbol('r', integer=True, positive=True) 

theta = sp.Symbol ('theta') 

# Create the function symbolically 

from sympy import factorial 

cNkSym= recurra (ia) )/ ractoria (Ge) arereieoueaLeull nA) 
cNkSym.evalf () 

binomSym= CcNkSymx ( (theta *x*r)*(1-theta) ** (n-r) ) 
binomSym.evalf () 


#Convert it to a Numpy-callable function 
binomLambda = sp.Lambda ( (theta, r,n), binomSym) 
maths (r"loperatorname(Bin)(r|n,1theta) = ") 
display (binomLambda.expr) 

#Evaluating the SymPy version results in: 

> Jas, subs (Tehecar 0. 3710800, i 8 0107) 
#Evaluating the pure Numpy version results in: 
> binomLambda(0.5,50,100)= 0.07958923 


FIGURE 5.31: Likelihood function using SymPy 


The Symbolic representation results in the following LaTeX: 


6" (—0 +1)" n! 
ri(n—r)! 


Bin(r|n, 0) = 


2. The Beta distribution can be created as follows. 


(5.78) 
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1ja = sp.Symbol('a', integer=False, positive=True) 
2)b = sp.Symbol('b', integer=False, positive=True) 
3| mu = sp.Symbol('mu', integer=False, positive=True) 


4|# Create the function symbolically 

5|G = sp.gamma 

6|# The normalisation factor 

7|BetaNormSym = G(a + b)/(G(a)*G(b)) 

s|# The functional form 

9 |BetaFSym = thetax*x(a-1) x» (1-theta) «x (b-1) 
o|BetaSym=BetaNormSym * BetaFSym 
1|BetaSym.evalf () # this works 

214 Turn Beta into a function 

3|BetaLambda = sp.Lambda((theta,a,b), BetaNormSym * BetaFSym) 
4|maths (r"\operatorname{Beta}(\thetal|la,b) = ") 
5|display (BetaSym) 

6|#Evaluating the SymPy version results in: 

7|> BetaLambda (0.5,2,7)=0.4375 

s|#Evaluating the pure Numpy version results in: 
9|> BetaSym.subs({theta:0.5,a:2,b:7})=0.4375 


FIGURE 5.32: Beta distribution using SymPy 


The result is: 


6°" Ta + b) 


O > 


(84D (5.79) 


3. The plot is depicted in 5.33. 
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5.3. SOLUTIONS 


spylab inline 

mus = arange(0,1,.01) 

# Plot for various values of a and b 

E A a e 

plot (mus, vectorize(BetaLambda) (mus, xab), label="a=%s b=%s" % ab) 
legend (loc=0) 

xlabel (r"S\thetaS", size=22) 


87 a=0.1b=0.1 
a=0.5 b=0.5 
a=2 b=20 
a=2 b=3 
a=1 b=1 


FIGURE 5.33: A plot of the Beta distribution 


4, We can find the posterior distribution by multiplying our Beta prior by the Binomial 
Likelihood. 
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a = sp.Symbol('a', integer=False, positive=True) 
b = sp.Symbol('b', integer=False, positive=True) 
BetaBinSym=BetaSym * binomSym 


ica Iseieel—lojitia) TACO El im hal@ie aoe 

BetaBinLambda = sp.Lambda((theta,a,b,n,r), BetaBinSym) 
BetaBinSym=BetaBinSym.powsimp () 

display (BetaBinsym) 

maths (r"loperatorname(Beta) (1thetala,b) times 

+ \operatorname{Bin}(r|n,\theta) \propto %s" % 

> sp.latex(BetaBinSym) ) 

qi Scala, subs (theta: Oo ede: 2 198 1 118 110/1283) UDS 
o|> BetaBinLambda (0.5,2,7, 10,3)= 0.051269 


oN DOD OT FF OD N 
HR 


FIGURE 5.34: A plot of the Beta distribution 


The result is: 
Beta(@|a, b) x Bin(r|n, 0) « 


pa a Al 
r\(n—r)!T(a)I(b) 


I(a +b) 


So the posterior distribution has the same functional dependence on 0 as the prior, it is 
just another Beta distribution. 


5. Mathematically, the relationship is as follows: 
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5.3. SOLUTIONS 


Prior : 


Beta(@|a = 2, b = 7) 


= 560 (-0 +1) 


Likelihood : (5.80) 


Bin(r = 3|n = 6,0) = 196000* (—4 4 


L 1)“ 


Posterior(normalised) : 


Beta(0|2,7) x Bin(3|50, 0) = 10976000* (-0 + 


180 


Chapter 5 DEEP LEARNING: CALCULUS, ALGORITHMIC DIFFERENTIATION 


1/prior = BetaLambda (theta, 2,7) 
2|maths ("\mathbf{Prior}:\operatorname {Beta} (\theta|a=2,b=7) Ss" % 


> sp.latex(prior) ) 
3| likelihood = binomLambda (theta,3,50) # = binomLambda (0.5,3,10) 

4|maths ("\mathbf{Likelihood}: \operatorname {Bin} (r=3|n=6,\theta) = 
> %s" > sp.latex (likelihood) ) 

s|posterior = prior * likelihood 

6|posterior=posterior.powsimp () 

7 |maths (r"\mathbf{Posterior 

+  (normalised)):loperatorname(Beta) (\theta|2,7) \times 

«+ \operatorname {Bin} (3/50, \theta)=%s" 
s|posterior.subs({theta:0.5}) 

9/plt.plot(mus, (sp.lambdify(theta,posterior)) (mus), 'r') 
aleros (MEN encia, Elias) 


0.2 0.4 0.6 0.8 10 


FIGURE 5.35: A plot of the Posterior with the provided data samples. 
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CHAPTER 


DEEP LEARNING: NN ENSEMBLES 


The saddest aspect of life right now is that gathers knowledge faster than society 
gathers wisdom. 


— Isaac Asimov 
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6.1. INTRODUCTION 


6.1 Introduction 


a TAN] Ntuition and practice demonstrate that a poor or an inferior choice may 
y] de altogether prevented merely by motivating a group (or an ensemble) 
of people with diverse perspectives to make a mutually acceptable choice. 
2. || Likewise, in many cases, neural network ensembles significantly improve 
the M ability of single-model based AI systems [5, 11]. Shortly follow- 
ing the foundation of Kaggle, research in the field had started blooming; not only 
because researchers are advocating and using advanced ensembling approaches in 
almost every competition, but also by the empirical success of the top winning mod- 
els. Though the whole process of training ensembles typically involves the utilization 
of dozens of GPUs and prolonged training periods, ensembling approaches enhance 
the predictive power of a single model. Though ensembling obviously has a signific- 
ant impact on the performance of AI systems in general, research shows its effect is 
particularly dramatic in the field of neural networks [Russakovsky_2015, 1, 4, 7, 13]. 
Therefore, while we could examine combinations of any type of learning algorithms, 
the focus of this chapter is the combination of neural networks. 


6.2 Problems 
6.2.1 Bagging, Boosting and Stacking 


PRB-135 @ CH.PRB- 6.1. 
Mark all the approaches which can be utilized to boost a single model performance: 


(i) Majority Voting 

(11) Using K-identical base-learning algorithms 
(iti) Using K-different base-learning algorithms 
(iv) Using K-different data-folds 

(v) Using K-different random number seeds 


(vi) A combination of all the above approaches 
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PRB-136 €) CH.PRB- 6.2. 

An argument erupts between two senior data-scientists regarding the choice of an ap- 
proach for training of a very small medical corpus. One suggest that bagging is superior 
while the other suggests stacking. Which technique, bagging or stacking, in your opinion is 
superior? Explain in detail. 


(1) Stacking since each classier is trained on all of the available data. 


(11) Bagging since we can combine as many classifiers as we want by training each on a 
different sub-set of the training corpus. 


PRB-137 O CH.PRB- 6.3. 
Complete the sentence: A random forest is a type of a decision tree which utilizes [bag- 
ging/boosting] 


PRB-138 @ CH.PRB- 6.4. 
The algorithm depicted in Fig. 6.1 was found in an old book about ensembling. Name the 
algorithm. 
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6.2. PROBLEMS 


Algorithm 1: Algo 1 

Data: A set of training data, Q with N elements has been established 

while K times do 
Create a random subset of N’ data by sampling from Q containing the N 

samples; 

N'< N; 
Execute algorithm Algo 2; 
Return all N’ back to Q 


Algorithm 2: Algo 2 


Choose a learner hm; 
while K times do 
| Pick a training set and train with hm; 


FIGURE 6.1: A specific ensembling approach 


PRB-139 @ CH.PRB- 6.5. 
Fig. 6.2 depicts a part of a specific ensembling approach applied to the models x1, £2...£p. 
In your opinion, which approach is being utilized? 


Generelizer -0 ~ 


FIGURE 6.2: A specific ensembling approach 


Base Learners 
8 8 
w N 


(i) Bootstrap aggregation 
(ii) Snapshot ensembling 


(iii) Stacking 
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| (iv) Classical committee machines 


PRB-140 O CH.PRB- 6.6. 
Consider training corpus consisting of balls which are glued together as triangles, each 
of which has either 1, 3, 6, 10, 15, 21, 28, 36, or 45 balls. 


1. We draw several samples from this corpus as presented in Fig. 6.3 wherein each sample 
is equiprobable. What type of sampling approach is being utilized here? 


Bho o o BS BH 


FIGURE 6.3: Sampling approaches 


(i) Sampling without replacement 


(ii) Sampling with replacement 


2. Two samples are drawn one after the other. In which of the following cases is the 
covariance between the two samples equals zero? 


(i) Sampling without replacement 


(ii) Sampling with replacement 


3. During training, the corpus sampled with replacement and is divided into several 
folds as presented in Fig. 6.4. 


T:%HBo o o BM Le 
T2: Bo Bho BR M 
T3: de oo AÑ à 
T4: BHo o 0 AM D 


FIGURE 6.4: Sampling approaches 
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6.2. PROBLEMS 


If 10 balls glued together is a sample event that we know is hard to correctly classify, 
then it is impossible that we are using: 


(i) Bagging 
(ii) Boosting 


6.2.2 Approaches for Combining Predictors 


PRB-141 @ CH.PRB- 6.7. 


There are several methods by which the outputs of base classifiers can be combined to 
yield a single prediction. Fig. 6.5 depicts part of a specific ensembling approach applied to 
several CNN model predictions for a labelled data-set. Which approach is being utilized? 


(i) Majority voting for binary classification 
(ti) Weighted majority voting for binary classification 
(iii) Majority voting for class probabilities 
(10) Weighted majority class probabilities 
(v) An algebraic weighted average for class probabilities 


(vi) An adaptive weighted majority voting for combining multiple classifiers 


190 


Chapter 6 DEEP LEARNING: NN ENSEMBLES 


cl = 1] 

2|for i,f in enumerate (filelist): 

3 temp = pd.read_csv(f) 

4 l.append (temp) 

slarr = np.stack(l,axis=-1) 

6lavg_results = pd.DataFrame (arr[:,:-1,:].mean(axis=2)) 
7lavg_results['image'] = 1[0]['image'] 

8 |avg_results.columns = 1[0] .columns 


FIGURE 6.5: PyTorch code snippet for an ensemble 


PRB-142 @ CH.PRB- 6.8. 

Read the paper Neural Network Ensembles [3] and then complete the sentence: If the 
average error rate for a specific instance in the corpus is less than [...]% and the respective 
classifiers in the ensemble produce independent [...], then when the number of classifiers 
combined approaches infinity, the expected error can be diminished to zero. 


PRB-143 € CH.PRB- 6.9. 
True or false: A perfect ensemble comprises of highly correct classifiers that differ as 
much as possible. 


PRB-144 O CH.PRB- 6.10. 

True or false: In bagging, we re-sample the training corpus with replacement and there- 
fore this may lead to some instances being represented numerous times while other instances 
not to be represented at all. 


6.2.3 Monolithic and Heterogeneous Ensembling 


| PRB-145 @ CH.PRB- 6.11. 
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6.2. PROBLEMS 


1. True or false: Training an ensemble of a single monolithic architecture results in 
lower model diversity and possibly decreased model prediction accuracy. 


2. True or false: The generalization accuracy of an ensemble increases with the number 
of well-trained models it consists of. 


3. True or false: Bootstrap aggregation (or bagging), refers to a process wherein a CNN 
ensemble is being trained using a random subset of the training corpus. 


4. True or false: Bagging assumes that if the single predictors have independent errors, 
then a majority vote of their outputs should be better than the individual predictions. 


PRB-146 @ CH.PRB- 6.12. 

Refer to the papers: Dropout as a Bayesian Approximation [2] and Can You Trust 
Your Model’s Uncertainty? [12] and answer the following question: Do deep ensembles 
achieve a better performance on out-of-distribution uncertainty benchmarks compared with 
Monte-Carlo (MC)-dropout? 


PRB-147 @ CH.PRB- 6.13. 


1. In a transfer-learning experiment conducted by a researcher, a number of ImageNet- 
pretrained CNN classifiers, selected from Table 6.1 are trained on five different folds 
drawn from the same corpus. Their outputs are fused together producing a composite 
machine. Ensembles of these convolutional neural networks architectures have been 
extensively studies an evaluated in various ensembling approaches [4, 9]. Is it likely 
that the composite machine will produce a prediction with higher accuracy than that 
of any individual classifier? Explain why. 
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CNN Model Classes ImageSize Top-1 accuracy 


ResNet152 1000 224 78.428 
DPN98 1000 224 79.224 
SeNet154 1000 224 81.304 
SeResneXT101 1000 224 80.236 
DenseNet161 1000 224 77.960 
InceptionV4 1000 299 80.062 


TABLE 6.1: ImageNet-pretrained CNNs. Ensembles of these CNN architectures have been 
extensively studies and evaluated in various ensembling approaches. 


2. True or False: In a classification task, the result of ensembling is always superior. 


3. True or False: In an ensemble, we want differently trained models converge to differ- 
ent local minima. 


PRB-148 @ CH.PRB- 6.14. 
In committee machines, mark all the combiners that do not make direct use of the input: 


(i) A mixture of experts 
(ii) Bagging 
(iii) Ensemble averaging 


(iv) Boosting 


PRB-149 @ CH.PRB- 6.15. 

True or False: Considering a binary classification problem (y = 0 or y = 1), ensemble 
averaging, wherein the outputs of individual models are linearly combined to produce a fused 
output is a form of a static committee machine. 
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6.2. PROBLEMS 


FIGURE 6.6: A typical binary classification problem. 


PRB-150 @ CH.PRB- 6.16. 

True or false: When using a single model, the risk of overfitting the data increases when 
the number of adjustable parameters is large compared to cardinality (i.e., size of the set) of 
the training corpus. 


PRB-151 O CH.PRB- 6.17. 
True or false: If we have a committee of K trained models and the errors are uncorrelated, 
then by averaging them the average error of a model is reduced by a factor of K. 


6.2.4 Ensemble Learning 


PRB-152 @ CH.PRB- 6.18. 
1. Define ensemble learning in the context of machine learning. 
Provide examples of ensemble methods in classical machine-learning. 


True or false: Ensemble methods usually have stronger generalization ability. 


A WwW N 


Complete the sentence: Bagging is variancelbias reduction scheme while boosting 
reduced variancelbias. 
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6.2.5 Snapshot Ensembling 


PRB-153 € CH.PRB- 6.19. 

Your colleague, a well-known expert in ensembling methods, writes the following pseudo 
code in Python shown in Fig. 6.7 for the training of a neural network. This runs inside a 
standard loop in each training and validation step. 


import torchvision.models as models 


models = ['resnext'] 
for m in models: 


Escala oe 

compute VAL loss ... 

amend LR ... 

2 (al see > 90.01 
saveModel () 


NV 0 NJ Dd 0 Ba Y N e 


FIGURE 6.7: PyTorch code snippet for an ensemble 


1. What type of ensembling can be used with this approach? Explain in detail. 


2. What is the main advantage of snapshot ensembling? What are the disadvantages, if 
any? 


PRB-154 @ CH.PRB- 6.20. 
Assume further that your colleague amends the code as follows in Fig. 6.8. 
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6.2. PROBLEMS 


import torchvision.models as models 
import random 
import np 


models = ['resnext'] 


Co 0. NX 0 0 FF 0 No. 


for m in models: 

Ereni oor 

compute loss 

amend LR ... 
10 manualSeed= draw a new random number 
1 random. seed (manualSeed) 
12 np.random. seed (manualSeed) 
13 torch.manual_seed (manualSeed) 
14 az (yal ace > 90.0) 2 
15 saveModel () 


FIGURE 6.8: PyTorch code snippet for an ensemble 


Explain in detail what would be the possible effects of adding lines 10-13. 


6.2.6  Multi-model Ensembling 


PRB-155 € CH.PRB- 6.21. 


1. Assume your colleague, a veteran in DL and an expert in ensembling methods writes 
the following Pseudo code shown in Fig. 6.9 for the training of several neural networks. 
This code snippet is executed inside a standard loop in each and every training/valida- 
tion epoch. 


196 


Chapter 6 DEEP LEARNING: NN ENSEMBLES 


import torchvision.models as models 


models = ['resnext','vgg','dense'] 
for m in models: 


Ecce 

compu ellos ¡Sac cia 

alge (ellas = 0-0) 
saveModel () 


oN o oO A O No. 


FIGURE 6.9: PyTorch code snippet for an ensemble 


What type of ensembling is being utilized in this approach? Explain in detail. 


2. Name one method by which NN models may be combined to yield a single prediction. 


6.2.7 Learning-rate Schedules in Ensembling 


PRB-156 @ CH.PRB- 6.22. 


1. Referring to Fig. (6.10) which depicts a specific learning rate schedule, describe the 
basic notion behind its mechanism. 
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6.3. SOLUTIONS 


FIGURE 6.10: A learning rate schedule. 


2. Explain how cyclic learning rates [10] can be effective for the training of convolutional 
neural networks such as the ones in the code snippet of Fig. 6.10. 


3. Explain how a cyclic cosine annealing schedule as proposed by Loshchilov [10] and 
[13] is used to converge to multiple local minima. 


6.3 Solutions 
6.3.1 Bagging, Boosting and Stacking 


SOL-135 Y CH.SOL- 6.1. 
All the presented options are correct. a 


SOL-136 y CH.SOL- 6.2. 
The correct choice would be stacking. In cases where the given corpus is small, we would 
most likely prefer training our models on the full data-set. a 


SOL-137 Y CH.SOL- 6.3. 
A random forest is a type of a decision tree which utilizes bagging. a 
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SOL-138 Uy CH.SOL- 6.4. 
The presented algorithm is a classic bagging. a 


SOL-139 Y CH.SOL- 6.5. 

The approach which is depicted is the first phase of stacking. In stacking, we first (phase 
0) predict using several base learners and then use a generalizer (phase 1) that learns on top 
of the base learners predictions. m 


SOL-140 UY CH.SOL- 6.6. 


1. Sampling with replacement 
2. Sampling without replacement 


3. This may be mostly a result of bagging, since in boosting we would have expected miss- 
correctly classified observations to repeatedly appear in subsequent samples. 


6.3.2 Approaches for Combining Predictors 


SOL-141 UY CH.SOL- 6.7. 
An Algebraic weighted average for class probabilities. m 


SOL-142 @ CH.SOL- 6.8. 
This is true, [3] provides a mathematical proof. m 


SOL-143 YY CH.SOL- 6.9. 
This is true. For extension, see instance [8]. m 


SOL-144 @ CH.SOL- 6.10. 
This is true. In a bagging approach, we first randomly draw (with replacement), K ex- 
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6.3. SOLUTIONS 


amples where K is the size of the original training corpus therefore leading to an imbalanced 
representation of the instances. a 


6.3.3 Monolithic and Heterogeneous Ensembling 


SOL-145 Y CH.SOL- 6.11. 


1. True Due to their lack of diversity, an ensemble of monolithic architectures tends to 
perform worse than an heterogeneous ensemble. 


2. True This has be consistently demonstrated in [11, 5]. 


3. True In [6] there is a discussion about both using the whole corpus and a subset much 
like in bagging. 


4, True The total error decreases with the addition of predictors to the ensemble. 


SOL-146 Y CH.SOL- 6.12. 
Yes, they do. a 


SOL-147 Y CH.SOL- 6.13. 


1. Yes, it is very likely, especially if their errors are independent. 


2. True It may be proven that ensembles of models perform at least as good as each of the 
ensemble members it consists of. 


3. True Different local minima add to the diversification of the models. 


SOL-148 Y CH.SOL- 6.14. 
Boosting is the only one that does not. m 
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SOL-149 UY CH.SOL- 6.15. 
False By definition, static committee machines use only the output of the single predict- 
a 


ors. 


SOL-150 Y CH.SOL- 6.16. 
True 


SOL-151 UY CH.SOL- 6.17. 
False Though this may be theoretically true, in practice the errors are rarely uncorrelated 
and therefore the actual error can not be reduced by a factor of K. a 


6.3.4 Ensemble Learning 


SOL-152 Y CH.SOL- 6.18. 


1. Ensemble learning is an excellent machine learning idea which displays noticeable bene- 
fits in many applications, one such notable example is the widespread use of ensembles 
in Kaggle competitions. In an ensemble several individual models (for instance Res- 
Net18 and VGG16) which were trained on the same corpus, work in tandem and during 
inference, their predictions are fused by a pre-defined strategy to yield a single predic- 
tion. 


2. In classical machine learning Ensemble methods usually refer to bagging, boosting and 
the linear combination of regression or classification models. 


3. True The stronger generalization ability stems from the voting power of diverse models 
which are joined together. 


4. Bagging is variance reduction scheme while boosting reduced bias. 


6.3.5 Snapshot Ensembling 
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6.3. SOLUTIONS 


SOL-153 Y CH.SOL- 6.19. 


1. Since only a single model ie being utilized, this type of ensembling is known as snap- 
shot ensembling. Using this approach, during the training of a neural network and 
in each epoch, a snapshot, e.g. the weights of a trained instance of a model (a PTH 
file in PyTorch nomenclature) are persisted into permanent storage whenever a certain 
performance metrics, such as accuracy or loss is being surpassed. Therefore the name 
“snapshot”; weights of the neural network are being snapshot at specific instances in 
time. After several such epochs the top-5 performing Snapshots which converged to 
local minima [4] are combined as part of an ensemble to yield a single prediction. 


2. Advantages: during a single training cycle, many model instances may be collected. 
Disadvantages: inherent lack of diversity by virtue of the fact that the same models is 
being repeatedly used. 


SOL-154 UY CH.SOL- 6.20. 
Changing the random seed at each iteration/epoch, helps in introducing variation which 
may contribute to diversifying the trained neural network models. m 


6.3.6 Multi-model Ensembling 


SOL-155 Y CH.SOL- 6.21. 


1. Multi-model ensembling. 


2. Both averaging and majority voting. 


6.3.7 Learning-rate Schedules in Ensembling 


so1-ass CH.SOL- 6.22. 
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1. 


[3] 


[4] 


Capturing the best model of each training cycle allows to obtain multiple models settled 
on various local optima from cycle to cycle at the cost of training a single mode 


. The approach is based on the non-convex nature of neural networks and the ability to 


converge and escape from local minima using a specific schedule to adjust the learning 
rate during training. 


. Instead of monotonically decreasing the learning rate, this method lets the learning rate 


cyclically vary between reasonable boundary values. 
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What goes up must come down. 


— Isaac Newton 
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7.1 Introduction 


28) HE extraction of an n-dimensional feature vector (FV) or an embedding from 
| one (or more) layers of a pre-trained CNN, is termed feature extraction (FE). 
Usually, FE works by first removing the last fully connected (FC) layer from 
=! a CNN and then treating the remaining layers of the CNN as a fixed FE. As 
exemplified in Fig. (7.1) and Fig. (7.2), applying this method to the ResNet34 archi- 
tecture, the resulting FV consists of 512 floating point values. Likewise, applying the 


same logic on the ResNet152 architecture, the resulting FV has 2048 floating point ele- 
ments. 


7.2. PROBLEMS 


II S| PE O E E 


A fixed k-element FV. 
0.7766 | 0.4455 | 0.8342 | 0.6324 |--- |k = 512 


Actual values of a normalized k-element FV. 


FIGURE 7.1: A one-dimensional 512-element embedding for a single image from the Res- 
Net34 architecture. While any neural network can be used for FE, depicted is 
the ResNet CNN architecture with 34 layers. 


import torchvision.models as models 


= 


w 


res_model = models.resnet34 (pretrained=True) 


FIGURE 7.2: PyTorch decleration for a pre-trained ResNet34 CNN (simplified). 


The premise behind FE is that CNNs which were originally trained on the Im- 
ageNet Large Scale Visual Recognition Competition [7], can be adapted and used (for 
instance in a classification task) on a completely different (target) domain without any 
additional training of the CNN layers. The power of a CNN to do so lies in its ability 
to generalize well beyond the original data-set it was trained on, therefore FE on a 
new target data-set involves no training and requires only inference. 


7.2 Problems 


7.2.1 CNN as Fixed Feature Extractor 
Before attempting the problems in this chapter you are highly encouraged to read the 
following papers [1, 3, 7]. In many DL job interviews, you will be presented with a 


paper you have never seen before and subsequently be asked questions about it; so 
reading these references would be an excellent simulation of this real-life task. 
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PRB-157 @ CH.PRB- 7.1. 

True or False: While AlexNet [4] used 11 x 11 sized filters, the main novelty presented 
in the VGG [8] architecture was utilizing filters with much smaller spatial extent, sized 
3 Xx 3. 


PRB-158 @ CH.PRB- 7.2. 
True or False: Unlike CNN architectures such as AlexNet or VGG, ResNet does not 
have any hidden FC layers. 


PRB-159 @ CH.PRB- 7.3. 

Assuming the VGG-Net has 138, 357, 544 floating point parameters, what is the phys- 
ical size in Mega-Bytes (MB) required for persisting a trained instance of VGG-Net on 
permanent storage? 


PRB-160 @ CH.PRB- 7.4. 

True or False: Most attempts at researching image representation using FE, focused 
solely on reusing the activations obtained from layers close to the output of the CNN, and 
more specifically the fully-connected layers. 


PRB-161 O CH.PRB- 7.5. 

True or False: FE in the context of deep learning is particularly useful when the target 
problem does not include enough labeled data to successfully train CNN that generalizes 
well. 


PRB-162 @ CH.PRB- 7.6. 
Why is a CNN trained on the ImageNet dataset [7] a good candidate for a source prob- 
lem? 


| PRB-163 € CH.PRB- 7.7. 
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7.2. PROBLEMS 


Complete the missing parts regarding the VGG19 CNN architecture: 
1. The VGG19 CNN consists of [...] layers. 

. It consists of [...] convolutional and 3 [...] layers. 

. The input image size is [...]. 

. The number of input channels is [...]. 

. Every image has it's mean RGB value [subtracted / added]. 

. Each convolutional layer has a [small/large] kernel sized [...]. 

. The number of pixels for padding and stride is [...]. 


. There are 5 [...] layers having a kernel size of [...] and a stride of [...] pixels. 


O DO N `A oD A Q N 


. For non-linearity a [rectified linear unit (ReLU [5])/sigmoid] is used. 


~ 
© 


. The [...] FC layers are part of the linear classifier. 


m 
m 


. The first two FC layers consist of [...] features. 


=m 
No 


. The last FC layer has only [...] features. 


m 
1S) 


. The last FC layer is terminated by a [...] activation layer. 


m 
HS 


. Dropout [is / is not] being used between the FC layers. 


PRB-164 @ CH.PRB- 7.8. 

The following question discusses the method of fixed feature extraction from layers of the 
VGG19 architecture [8] for the classification of pancreatic cancer. It depicts FE principles 
which are applicable with minor modifications to other CNNs as well. Therefore, if you hap- 
pen to encounter a similar question in a job interview, you are likely be able to cope with 
it by utilizing the same logic. In Fig. (9.7) three different classes of pancreatic cancer are 
displayed: A, B and C, curated from a dataset of 4K Whole Slide Images (WSI) labeled by 
a board certified pathologist. Your task is to use FE to correctly classify the images in the 
dataset. 
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FIGURE 7.3: A dataset of 4K histopathology WSI from three severity classes: A, B and C. 


Table (9.3) presents an incomplete listing of the of the VGG19 architecture [8]. As de- 
picted, for each layer the number of filters (i.e., neurons with unique set of parameters), 
learnable parameters (weights,biases), and FV size are presented. 


Layer name #Filters #Parameters # Features 


conv4_3 512 2.3M 512 
fc6 4,096 103M 4,096 
fc7 4,096 17M 4,096 
output 1,000 4M - 
Total 13,416 138M 12,416 


TABLE 7.1: Incomplete listing of the VGG19 architecture 


1. Describe how the VGG19 CNN may be used as fixed FE for a classification task. In 
your answer be as detailed as possible regarding the stages of FE and the method used 
for classification. 


2. Referring to Table (9.3), suggest three different ways in which features can be extrac- 
ted from a trained VGG19 CNN model. In each case, state the extracted feature layer 
name and the size of the resulting FE. 


3. After successfully extracting the features for the 4K images from the dataset, how can 
you now classify the images into their respective categories? 
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PRB-165 O CH.PRB- 7.9. 

Still referring to Table (9.3), a data scientist suggests using the output layer of the 
VGG19 CNN as a fixed FE. What is the main advantage of using this layer over using 
for instance, the fct layer? (Hint: think about an ensemble of feature extractors) 


PRB-166 @ CH.PRB- 7.10. 
Still referring to Table (9.3) and also to the code snippet in Fig. (7.4), which represents a 
new CNN derived from the VGG19 CNN: 


import torchvision.models as models 


1 

BE ns 

3|class VGG19FE (torch.nn.Module): 

4| def _ init_ (self): 

5 super (VGG19FE, self).__init__() 

6 original_model = models.VGG19(pretrained=[1??7/]) 
Y 

8 

9 

0 


self.real_name = (((type(original_model).__name_ ))) 
self. realtnmame = "vggil9" 
self.features [227] 
1 self.classifier = torch.nn.Sequential ([?77/]) 
2 self.num_feats = []??7]] 
3 
4| def forward(self, x): 
5 f = self.features (x) 
6 f f.view(f.size(0), -1) 
7 f = [222] 
8 print (f.data.size()) 
9 return f 


FIGURE 7.4: PyTorch code snippet for extracting the fc7 layer from a pre-trained VGG19 
CNN model. 
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1. Complete line 6; what should be the value of [pretrained)? 
2. Complete line 10; what should be the value of (self.features)? 
3. Complete line 12; what should be the value of (self.num_feats)? 


4. Complete line 17; what should be the value of f? 


PRB-167 € CH.PRB- 7.11. 

We are still referring to Table (9.3) and using the skeleton code provided in Fig. (7.5) 
to derive a new CNN entitled ResNetBottom from the ResNet34 CNN, to extract a 512- 
dimensional FV for a given input image. Complete the code as follows: 


1. The value of (self.features) in line 7. 
2. The method in line 11. 


import torchvision.models as models 
res_model = models.resnet34 (pretrained=True) 


class ResNetBottom(torch.nn.Module): 

def __init_ (self, original_model): 
super (ResNetBottom, self).__init__() 
self.features [2727] 


def forward(self, X): 


No] oN Dd. oO AeA BW N Be 


x = []???]] 
10 x= x.view(x.size(0), -1) 
1 return x 


FIGURE 7.5: PyTorch code skeleton for extracting a 512-dimensional FV from a pre-trained 
ResNet34 CNN model. 
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PRB-168 @ CH.PRB- 7.12. 

Still referring to Table (9.3), the PyTorch based pseudo code snippet in Fig. (7.6) returns 
the 512-dimensional FV from the modified ResNet34 CNN, given a 3-channel RGB image 
as an input. 


import torchvision.models as models 
from torchvision import transforms 


test_trans = transforms.Compose ( [ 
transforms.Resize(imgnet_size), 


transforms.ToTensor(), 
transforms.Normalize([0.485, 0.456, 0.406], 
[O.229, 0,224, Mesas) 


def ResNet34FE (image, model): 
f=None 
image = test_trans (image) 
image = Variable (image, requires_grad=False) .cuda() 
image= image.cuda() 
f = model (image) 


E = Je Y Ea SAS (dl), il) 
print ("Size : {}".format (f.shape) ) 
AS size (als) 
20| print ("Size : ()".format(f.shape)) 
a| f =f.cpuí() .detach () .numpy () [0] 
2 | print ("Size : ()".format (f.shape) ) 


23| return f 


FIGURE 7.6: PyTorch code skeleton for extracting a 512-dimensional FV from a pre-trained 
ResNet34 CNN model. 


Answer the following questions regarding the code in Fig. (7.6): 
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1. What is the purpose of (test_trans) in line 5? 
2. Why is the parameter set to False in line 14? 


3. What is the purpose of (f.cpu()) in line 23? 
4, What is the purpose of (detach()) in line 23? 


5. What is the purpose of [numpyOl0]| in line 23? 


7.2.2 Fine-tuning CNNs 


PRB-169 @ CH.PRB- 7.13. 
Define the term fine-tuning (FT) of an ImageNet pre-trained CNN. 


PRB-170 O CH.PRB- 7.14. 
Describe three different methods by which one can fine-tune an ImageNet pre-trained 
CNN. 


PRB-171 @ CH.PRB- 7.15. 

Melanoma is a lethal form of malignant skin cancer, frequently misdiagnosed as a benign 
skin lesion or even left completely undiagnosed. 

In the United States alone, melanoma accounts for an estimated 6,750 deaths per annum 
[6]. With a 5-year survival rate of 98%, early diagnosis and treatment is now more likely 
and possibly the most suitable means for melanoma related death reduction. Dermoscopy 
images, shown in Fig. (7.7) are widely used in the detection and diagnosis of skin lesions. 
Dermatologists, relying on personal experience, are involved in a laborious task of manually 
searching dermoscopy images for lesions. 

Therefore, there is a very real need for automated analysis tools, providing assistance to 
clinicians screening for skin metastases. In this question, you are tasked with addressing 
some of the fundamental issues DL researchers face when building deep learning pipelines. 
As suggested in [3], you are going to use ImageNet pre-trained CNN to resolve a classifica- 
tion task. 
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#4 
[i 

wa 
tg 


FIGURE 7.7: Skin lesion categories. An exemplary visualization of melanoma. 


1. Given that the skin lesions fall into seven distinct categories, and you are training us- 
ing cross-entropy loss, how should the classes be represented so that a typical PyTorch 
training loop will successfully converge? 


2. Suggest several data augmentation techniques to augment the data. 


3. Write a code snippet in PyTorch to adapt the CNN so that it can predict 7 classes 
instead of the original source size of 1000. 


4. In order to fine tune our CNN, the (original) output layer with 1000 classes was 
removed and the CNN was adjusted so that the (new) classification layer comprised 
seven softmax neurons emitting posterior probabilities of class membership for each 
lesion type. 


7.2.3 Neural style transfer, NST 


Before attempting the problems in the section, you are strongly recommended to read 
the paper: “A Neural Algorithm of Artistic Style” [2]. 


PRB-172 O CH.PRE- 7.16. 
Briefly describe how neural style transfer (NST) [2] works. 


PRB-173 O CH.PRE- 7.17. 
Complete the sentence: When using the VGG-19 CNN [8] for neural-style transfer, 
there different images are involved. Namely they are: [...], [...] and [...]. 
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PRB-174 @ CH.PRB- 7.18. 
Refer to Fig. 7.8 and answer the following questions: 


be ES 
Style image 
DÁ / 


FIGURE 7.8: Artistic style transfer using the style of Francis Picabia's Udnie painting. 


1. Which loss is being utilized during the training process? 


2. Briefly describe the use of activations in the training process. 


PRB-175 O CH.PRE- 7.19. 
Still referring to Fig. 7.8: 


1. How are the activations utilized in comparing the content of the content image to the 
content of the combined image?. 


2. How are the activations utilized in comparing the style of the content image to the 
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style of the combined image?. 


PRB-176 @ CH.PRE- 7.20. 
Still referring to Fig. 7.8. For a new style transfer algorithm, a data scientist extracts a 
feature vector from an image using a pre-trained ResNet34 CNN (7.9). 


import torchvision.models as models 


. 


res_model = models.resnet34 (pretrained=True) 


w 


FIGURE 7.9: PyTorch declaration for a pre-trained ResNet34 CNN. 


He then defines the cosine similarity between two vectors: 


u = {u1, u2, ..., Uy) and: 

v= {U1,V2,..-,Un} 

as: 
sim(u, v) = an Dini UV: 


ae ELE 


Thus, the cosine similarity between two vectors measures the cosine of the angle between 
the vectors irrespective of their magnitude. It is calculated as the dot product of two numeric 
vectors, and is normalized by the product of the length of the vectors. 

Answer the following questions: 


1. Define the term Gram matrix. 


2. Explain in detail how vector similarity is utilised in the calculation of the Gram mat- 
rix during the training of NST. 


7.3 Solutions 
7.3.1 CNN as Fixed Feature Extractor 
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SOL-157 & CH.SOL- 7.1. 

True. The increased depth in VGG-Net was made possible using smaller filters without 
substantially increasing the number of learnable parameters. Albeit an unwanted side effect 
of the usage of smaller filters is the increase in the number of filters per-layer. a 


SOL-158 UY CH.SOL- 7.2. 

True. The ResNet architecture terminates with a global average pooling layer followed 
by a K-way FC layer with a softmax activation function, where K is the number of classes 
(ImageNet has 1000 classes). Therefore, the ResNet has no hidden FC layers. a 


SOL-159 @ CH.SOL- 7.3. Note that 1bit = 0.000000125 MB, therefore: 


138, 357544 x 32 = 4427441408bits = 553.430176 MB. (7.1) 


SOL-160 Ud CH.SOL- 7.4. 
True. There are dozens of published papers supporting this claim. You are encouraged to 


search them on Arxiv or Google Scholar. m 


SOL-161 UY CH.SOL- 7.5. 

True. One of the major hurdles of training a medical Al system is the lack of annotated 
data. Therefore, extensive research is conducted to exploit ways for FE and transfer learning, 
e.g., in the application of ImageNet trained CNNs, to target datasets in which labeled data is 


scarce. 


SOL-162 UY CH.SOL- 7.6. 
There are two main reasons why this is possible: 


1. The huge number of images inside the ImageNet dataset ensures a CNN model that gen- 
eralizes to additional domains, like the histopathology domain, which is substantially 
different from the original domain the model was trained one (e.g., cats and dogs). 


Sl 


7.3. SOLUTIONS 


2. A massive array of disparate visual patterns is produced by an ImageNet trained CNN, 
since it consists of 1,000 different groups. 


SOL-163 y CH.SOL- 7.7. 
Complete the missing parts regarding the VGG19 CNN architecture: 


1. The VGG19 CNN consists of (19) layers. 

. It consists of (5) convolutional and 3 layers. 

. The input image size is (244), the default size most ImageNet trained CNNs work on. 
. The number of input channels is (3). 

. Every image has its mean RGB value (subtracted). ( why?) 


. Each convolutional layer has a kernel sized (3 x 3). (why?) 
. The number of pixels for padding and stride is the same and equals (1). 


. There are 5 convolutional layers having a kernel size of and a stride of (2) pixels. 


Oo O N A oD A Q N 


. For non-linearity a (rectified linear unit (ReLU [5)) is used. 


10. The (3) FC layers are part of the linear classifier. 
11. The first two FC layers consist of (4096) features. 
12. The last FC layer has only features. 


13. The last FC layer is terminated by a activation layer. 


14. Dropout ús) being used between the FC layers. 


so-ss CH.SOL- 7.8. 
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1. 


One or more layers of the VGG19 CNN are selected for extraction and a new CNN 
is designed on top of it. Thus, during inference our target layers are extracted and 
not the original softmax layer. Subsequently, we iterate and run inference over all 
the images in our pancreatic cancer data-set, extract the features, and persist them to 
permanent storage such as a solid-state drive (SSD) device. Ultimately, each image has 
a corresponding FV. 


. Regarding the VGG19 CNN, there are numerous ways of extracting and combining 


features from different layers. Of course, these different layers, e.g., the FC, conv4_3, 
and fc7 layer may be combined together to form a larger feature vector. To determine 
which method works best, you shall have to experiment on your data-set; there is no way 
of a-priory determining the optimal combination of layers. Here are several examples: 


(a) Accessing the last FC layer] resulting in a 1000-D FV. The output is the score for 


each of the 1000 classes of the ImageNet data-set. 
(b) Removing the last FC layer) leaves the fc7 layer, resulting in a 4096-D FV. 
(c) Directly accessing the |conv4_3 layer] results in a 512-D FV. 


. Once the FVs are extracted, we can train any linear classifier such as an SVM or 


softmax classifier on the FV data-set, and not on the original images. 


SOL-165 y CH.SOL- 7.9. 
One benefit of using the FC layer is that other ImageNet CNNs can be used in tandem 
with the VGG19 to create an ensemble since they all produce the same 1000-D sized FV. = 


SOL-166 & CH.SOL- 7.10. The full code is presented in Fig. (7.10). 
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import torchvision.models as models 


class VGG19FE (torch.nn.Module): 
def init__(self): 


super (VGG19FE, self) .__init__() 

original_model = models.VGG19 (pretrained=True) 
self.real_name = (((type(original_model) .__name__)) ) 
self.real_name = "vggl9" 

self.features = original_model.features 
self.classifier = torch.nn.Sequential ( 

Elie (orita mece ales salian 


children AO 
self.num_feats = 


def forward(self, x): 


f = self.features (x) 


E GE Yi (Ge, ses (0), 10) + (1, 4096) => (41096, .) 


f = self.classifier (f) 
print (f.data.size()) 
return f 


FIGURE 7.10: PyTorch code snippet for extracting the fc7 layer from a pre-trained VGG19 
CNN model. 


1. The value of the parameter should be True in order to instruct PyTorch to 


load an ImageNet trained weights. 


2. The value of should be loriginal_model.features). This is because we like to 
retain the layers of the original classifier (original_model). 

3. The value of {self.num_feats] should be (4096). (Why?) 

4. The value of |f| should be |self.classifier(f)| since our newly created CNN has to be in- 


voked to generate the FV. 
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SOL-167 Y CH.SOL- 7.11. 


1. Line number 7 in Fig. (7.11) takes care of extracting the the correct 512-D FV. 


2. Line number 11 in Fig. (7.11) extracts the correct 512-D FV by creating a sequential 
module on top of the existing features. 


import torchvision.models as models 
res_model = models.resnet34 (pretrained=True) 
class ResNetBottom(torch.nn.Module): 


1 

2 

3 

¿(det inic (self, original model): 

5| super (ResNetBottom, self) .__init__() 
6| self.features = [1277] 

7|def forward(self, x): 

8| x = [/???1] 

Pll oe SS Y (Sen Sia 10), il) 


return x 


=) 


FIGURE 7.11: PyTorch code snippet for extracting the fc7 layer from a pre-trained VGG19 
CNN model. 


SOL-168 Uy CH.SOL- 7.12. 


1. Transforms are incorporated into deep learning pipelines in order to apply one or more 
operations on images which are represented as tensors. Different transforms are usu- 
ally utilized during training and inference. For instance, during training we can use a 
transform to augment our data-set, while during inference our transform may be lim- 
ited only to normalizing an image. PyTorch allows the use of transforms either during 


training or inference. The purpose of in line 5 is to normalize the data. 


221 | 


7.3. SOLUTIONS 


2. The parameter is set to False in line 14 since during inference the com- 


putation of gradients is obsolete. 


3. The purpose of in line 11 is to move a tensor that was allocated on the GPU 
to the CPU. This may be required if we want to apply a CPU-based method from the 
Python numpy package on a Tensor that does not live in the CPU. 


4, (detach()) in line 23 returns a newly created tensor without affecting the current tensor. 
It also detaches the output from the current computational graph, hence no gradient is 
backpropagated for this specific variable. 


5. The purpose of \numpy()[0]] in line 23 is to convert the variable (an array) to a numpy 


compatible variable and also to retrieve the first element of the array. 


7.3.2. Fine-tuning CNNs 


SOL-169 Ud CH.SOL- 7.13. 

The term fine-tuning (FT) of an ImageNet pre-trained CNN refers to the method by which 
one or more of the weights of the CNN are re-trained on a new target data-set, which may or 
may-not have similarities with the ImageNet data-set. m 


SOL-170 YY CH.SOL- 7.14. The three methods are as follows: 


1. Replacing and re-training lonly the classifier) (usually the FC layer) of the ImageNet 


pre-trained CNN, on a target data-set. 
2. FT (all of the layers| of the ImageNet pre-trained CNN, on a target data-set. 


3. FT [part of the layers] of the ImageNet pre-trained CNN, on a target data-set. 


souan CH.SOL- 7.15. 
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1. The categories have to be represented numerically. One such option is presented in Code 
(7.1). 


EME 107 CANE BCC 2 WNC 2 Sipe WISI Ss ADES, 
S VASCI 9 


CODE 7.1: The seven categories of skin lesions. 


2. Several possible augmentations are presented in Code (7.2). It is usually, that by trial 
and error one finds the best possible augmentation for a target data-set. However, meth- 
ods such as AutoAugment may render the manual selection of augmentations obsolete. 


1|self.transforms = [] 

ALE rotate: 

3| self.transforms.append (RandomRotate ()) 
meks adlag 

5| self.transforms.append (RandomFlip ()) 
6|1£ brightness != 0: 

7| self.transforms.append(PILBrightness() ) 
s | 1É Contrast = 0: 

9| self.transforms.append (PILContrast () ) 

o Hut colorbalance '!— 0: 

1| self.transforms.append(PILColorBalance()) 
2|if sharpness != 0: 

3| self.transforms.append (PILSharpness()) 


CODE 7.2: Pseudeo code for augmentations. 


3. In contrast to the ResNet CNN which ends by an FC layer, the ImageNet pre-trained 
DPN CNN family, in this case the pretrainedmodels.dpn107, terminated by a Conv2d 
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layer and hence must be adapted accordingly if one wishes to change the number fo 
classes from the 1000 (ImageNet) classes to our skin lession classification problem (7 
classes). Line 7 in Code (7.3) demonstrated this idiom. 


1|import torch 
2|class Dpn107Finetune (nn.Module) : 
3|def __init_ (self, num_classes: int, net_kwards): 
a Speirs (eee EA) 
5| self.net = pretrainedmodels.dpn107 (x*x*xnet_kwards) 
6| self.net. _ name _= str (self.net) 
Z| Self net sclassitiien — conchonn.ConvZa (268187 
© num_classes, kernel_size=1) 
si print (self.net) 


CODE 7.3: Change between 1000 classes to 7 classes for the ImageNet pre-trained DPN 
CNN family. 


7.3.3 Neural style transfer 


SOL-172 UY CH.SOL- 7.16. 
The images are: a content image, a style image and lastly a combined image. a 


SOL-173 @ CH.SOL- 7.17. 
The algorithm presented in the paper suggests how to combine the content a first image 
with the style of a second image to generate a third, stylized image using CNNs. 


SOL-174 & CH.SOL- 7.18. 
The answers are as follows: 
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1. The training pipeline uses a combined loss which consists of a weighted average of the 
style loss and the content loss. 


2. Different CNN layers at different levels are utilized to capture both fine-grained styl- 
istic details as well as larger stylistic features. 


SOL-175 Y CH.SOL- 7.19. 


1. The content loss is the mean square error (MSE) calculated as the difference between 
the CNN activations of the last convolutional layer of both the content image and the 
style images. 


2. The style loss amalgamates the losses of several layers together. For each layer, the gram 
matrix (see 7.2) for the activations at that layer is obtained for both the style and the 
combined images. Then, just like in the content loss, the MSE of the Gram matrices is 
calculated. 


SOL-176 Y CH.SOL- 7.20. 
For each feature map, a feature vector is extracted. The gram matrix captures the correl- 
ation between these feature vectors which is then being used in the loss function. Provided a 


list of feature vectors extracted from the images, u1, ..., ug € R”, the Gram matrix is defined 
as: 
uu: Ur ... UU: Uk 
(7.2) 
Uk: Uy... UL: Uk 
The Gram matrix m 
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It is the weight, not numbers of experiments that is to be regarded. 


— Isaac Newton. 
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8.1 Introduction 


EN 1a T was Alex Krizhevsky who first demonstrated that a convolutional neural 
D network (CNN) can be effectively trained on the ImageNet large scale visual 
wi) recognition challenge. A CNN automatically provides some degree of trans- 
2. lation and assumes that we wish to learn filters, in a data-driven fashion, as 
a means to extract features describing the inputs. CNNs are applied to numerous com- 
puter vision, imaging, and computer graphics tasks as in [24], [23], [15], [5]. Further- 
more, they have become extremely popular, and novel architectures and algorithms 
are continually popping up overnight. 


8.2 Problems 
8.2.1 Cross Validation 


On the significance of cross validation and stratification in particular, refer to “A study 
of cross-validation and bootstrap for accuracy estimation and model selection” [17]. 


CV approaches 


PRB-177 O CH.PRB- 8.1. 
Fig (8.1) depicts two different cross-validation approaches. Name them. 


TRAIN L | VAL 


112,3|4]5/6|7 Bn] 


FIGURE 8.1: Two CV approaches 
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8.2. PROBLEMS 


PRB-178 O CH.PRB- 8.2. 


1. What is the purpose of the following Python code snippet 8.2 ? 


1|skf = StratifiedKFold (y, n_folds=5, random_state=989, 
©  Shuffle=True) 


FIGURE 8.2: Stratified K-fold 


. Explain the benefits of using the K-fold cross validation approach. 
. Explain the benefits of using the Stratified K-fold cross validation approach. 


. State the difference between K-fold cross validation and stratified cross validation. 


a A Q N 


. Explain in your own words what is meant by “We adopted a 5-fold cross-validation 
approach to estimate the testing error of the model”. 


K-Fold CV 


PRB-179 @ CH.PRB- 8.3. 


True or False: In a K-fold CV approach, the testing set is completely excluded from the 
process and only the training and validation sets are involved in this approach. 


PRB-180 O CH.PRB- 8.4. 
True or False: In a K-fold CV approach, the final test error is: 


1 k 
CVa) = z ) MSE; (8.1) 
i=1 
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PRB-181 O CH.PRB- 8.5. 
Mark all the correct choices regarding a cross-validation approach: 


(i) A 5-fold cross-validation approach results in 5-different model instances being fitted. 


(ti) A 5-fold cross-validation approach results in 1 model instance being fitted over and 
over again 5 times. 


(iii) A 5-fold cross-validation approach results in 5-different model instances being fitted 
over and over again 5 times. 


(iv) Uses K-different data-folds. 


PRB-182 @ CH.PRB- 8.6. 
Mark all the correct choices regarding the approach that should be taken to compute the 
performance of K-fold cross-validation: 


(i) We compute the cross-validation performance as the arithmetic mean over the K per- 
formance estimates from the validation sets. 


(ti) We compute the cross-validation performance as the best one over the K performance 
estimates from the validation sets. 


Stratification 


PRB-183 @ CH.PRB- 8.7. 

A data-scientist who is interested in classifying cross sections of histopathology image 
slices (8.3) decides to adopt a cross-validation approach he once read about in a book. Name 
the approach from the following options: 
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8.2. PROBLEMS 


K-fold CV 


FIGURE 8.3: A specific CV approach 


(i) 3-fold CV 
(11) 3-fold CV with stratification 
(iii) A (repeated) 3-fold CV 


LOOCV 


PRB-184 @ CH.PRB- 8.8. 


1. True or false: The leave-one-out cross-validation (LOOCV) approach is a sub-case of 
k-fold cross-validation wherein K equals N, the sample size. 


2. True or false: It is always possible to find an optimal value n, K = n in K-fold 
cross-validation. 


8.2.2 Convolution and correlation 


The convolution operator 


PRB-185 @ CH.PRB- 8.9. 
Equation 8.2 is commonly used in image processing: 


(Fg) =f SOI- rdr (6.2) 
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1. What does equation 8.2 represent? 


2. What does g(t) represent? 


PRB-186 @ CH.PRB- 8.10. 
A data-scientist assumes that: 


i A convolution operation is both linear and shift invariant. 


ii A convolution operation is just like correlation, except that we flip over the filter before 
applying the correlation operator. 


iii The convolution operation reaches a maximum, only in cases where the filter is mostly 
similar to a specific section of the input signal. 


Is he right in assuming so? Explain in detail the meaning of these statements. 


The correlation operator 


PRB-187 @ CH.PRB- 8.11. 
Mark the correct choice(s): 


1. The cross-correlation operator is used to find the location where two different signals 
are most similar. 


2. The autocorrelation operator is used to find when a signal is similar to a delayed ver- 
sion of itself. 


PRB-188 @ CH.PRB- 8.12. 
A data-scientist provides you with a formulae for a discrete 2D convolution operation 


(8.3): 


M-1N-1 


f(z,y) *h(x,y) = Y YS J(m,n)h(z—m,y-n) (8.3) 


m=0 n=0 
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Using only (8.3), write the equivalent 2D correlation operation. 


Padding and stride 
Recommended reading : “A guide to convolution arithmetic for deep learning” by Vincent 
Dumoulin and Francesco Visin (2016) [22]. 


PRB-189 @ CH.PRB- 8.13. 

When designing a convolutional neural network layer, one must also define how the filter 
or kernel slides through the input signal. This is controlled by what is known as the stride 
and padding parameters or modes. The two most commonly used padding approached in 


convolutions are the (VALID) and the (SAME) modes. Given an input stride of 1: 


1. Define SAME 


2. Define VALID 


PRB-190 € CH.PRB- 8.14. 
True or False: A valid convolution is a type of convolution operation that does not use 
any padding on the input. 


PRB-191 @ CH.PRB- 8.15. 
You are provided with a K x K input signal and a 0 x 0 filter. The signal is subjected to 
the valid padding mode convolution. What are the resulting dimensions? 


0... 0 
arr=[0 .. 0] (8.4) 
0 ca 0 


PRB-192 @ CH.PRB- 8.16. 
As depicted in (8.4), a filter is applied to a x3 input signal. Identify the correct choice 
given a stride of 1 and Same padding mode. 
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A Hn. 


FIGURE 8.4: A padding approach 


PRB-193 @ CH.PRB- 8.17. 
As depicted in in (8.5), a filter is applied to a 3 x 3 input signal, mark the correct choices 
given a stride of 1. 


(i) A represents a VALID convolution and B represents a SAME convolution 
(11) A represents a SAME convolution and B represents a VALID convolution 
(iii) Both A and B represent a VALID convolution 
(10) Both A and B represent a SAME convolution 
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= i 


FIGURE 8.5: A padding approach 


PRB-194 @ CH.PRB- 8.18. 
In this question we discuss the two most commonly used padding approaches in convo- 


lutions; (VALID) and (SAME). Fig.8.6 presents python code for generating an input signal 
arr001 and a convolution kernel filter001. The input signal, arr001 is first initialized to 


all zeros as follows: 


arr001 =| ] (8.5) 


ao SO O So 9 
© O So SS. 
¡A © 
SS SS O SS. O 
© O O G GOG © 
© O O So oS 


1. Without actually executing the code, determine what would be the resulting shape of 
the convolve2d() operation. 


2. Manually compute the result of convolving the input signal with the provided filter. 


3. Elaborate why the size of the resulting convolutions is smaller than the size of the 
input signal. 
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import numpy 
import scipy.signal 


arr01 = numpy.zeros((6, 6), dtype=float) 
print (arroi 

A AS IS O 

a AS = is 


Co 0. NX 0 0 24 YB No. 


filter001 = numpy.zeros((3, 3), dtype=float) 
1w|filter001[:,0] = 2.0 
v|tilterd0l[:,21 = =2.0 


1 |output = scipy.signal.convolve2d(arr01, filter, mode='valid') 


FIGURE 8.6: Convolution and correlation in python 


Kernels and filters 


PRB-195 @ CH.PRB- 8.19. 
Equation 8.6 is the discrete equivalent of equation 8.2 which is frequently used in image 
processing: 


(y * k)[i, j] =2,2 ui m m]k[n, m] (8.6) 
1. Given the following discrete kernel in the X direction, what would be the equivalent Y 
direction? 
1|-1 1 
k=- 8.7 
a 67 


2. Identify the discrete convolution kernel presented in (8.7). 
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-1 0 1 
peo o 2 | 
EN EM 


FIGURE 8.7: A 3 by 3 convolution kernel 


PRB-196 @ CH.PRB- 8.20. 
Given an image of size w x h, and a kernel with width K, how many multiplications and 
additions are required to convolve the image? 


Convolution and correlation in python 


PRB-197 @ CH.PRB- 8.21. 


Fig.8.8 presents two built-in Python functions for the convolution and correlation oper- 


ators. 


= 


import nympy as np 
np.convolve (A,B, "full 


N 


ics} 


np.correlate (A,B, "ful 


") 
ME) 


A LOr CONVO UE Ton 


A For Cross CO rela on 


FIGURE 8.8: Convolution and correlation in python 


1. Implement the convolution operation from scratch in Python. Compare it with the 
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built-in numpy equivalent. 


2. Implement the correlation operation using the implementation of the convolution op- 
eration. Compare it with the built-in numpy equivalent. 


Separable convolutions 


PRB-198 € CH.PRB- 8.22. 
The Gaussian distribution in the 1D and 2D is shown in Equations 8.8 and 8.9. 


1 22 


G(x) = . (8.8) 
1 _ 224? 
G(x,y) = az qe (8.9) 


The Gaussian filter, is an operator that is used to blur images and remove detail and 
noise while acting like a low-pass filter. This is similar to the way a mean filter works, but 
the Gaussian filter uses a different kernel. This kernel is represented with a Gaussian bell 
shaped bump. 

Answer the following questions: 


1. Can 8.8 be used directly on a 2D image? 
2. Can 8.9 be used directly on a 2D image? 


3. Is the Gaussian filter separable? if so, what are the advantages of separable filters. 


8.2.3 Similarity measures 


Image, text similarity 


PRB-199 @ CH.PRB- 8.23. 
A data scientist extracts a feature vector from an image using a pre-trained ResNet34 
CNN (9.5). 
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. 


ics} 


So o Ny DO ® WwW NH 


a 
o 


import torchvision.models as models 


res_model = models.resnet34 (pretrained=True) 


FIGURE 8.9: PyTorch declaration for a pre-trained ResNet34 CNN (simplified). 


He then applies the following algorithm, entitled xxx on the image (9.2). 


void xxx (std: :vector<float>s£ arr) { 
float mod = 0.0; 

for (float i: arr) { 

mod += i x i; 

} 
float mag = std::sqrt (mod); 
for (float & i: arr) { 

i /= mag; 

} 
} 


An unknown algorithm in C++11 


FIGURE 8.10: listing 


Which results in this vector (8.11): 


0.7766 | 0.4455 | 0.8342 | 0.6324 | ++ | k = 512 


Values after applying xxx to a k-element FV. 


FIGURE 8.11: A one-dimensional 512-element embedding for a single image from the Res- 
Net34 architecture. 


Name the algorithm that he used and explain in detail why he used it. 
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PRB-200 @ CH.PRB- 8.24. 
Further to the above, the scientist then applies the following algorithm: 


Algorithm 3: Algo 1 


Data: Two vectors v1 and v2 are provided 
Apply algorithm xxx on the two vectors 
Run algorithm 2 


Algorithm 4: Algo 2 


1float algo2 (const std::vector<float>é vl, const 
> std::vector<float>é& v2) { 

double mul = 0; 

for (size_t i = 0; i < vl.size(); ++1)( 

mul += v1[i] *« v2[i]; 


if (mul < 0) { 
return 0; 


} 


2. 
3 
4 
5| } 
6 
7 
8 
9 return mul; 


10 } 


FIGURE 8.12: An unknown algorithm 


1. Name the algorithm algo2 that he used and explain in detail what he used it for. 
2. Write the mathematical formulae behind it. 
3. What are the minimum and maximum values it can return? 
4. An alternative similarity measures between two vectors is: 
SiMeuc(V1, V2) = —||v, — vall. (8.10) 


Name the measure. 
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Jacard similarity 


PRB-201 @ CH.PRB- 8.25. 


1. What is the formulae for the Jaccard similarity [12] of two sets?: 
2. Explain the formulae in plain words. 


3. Find the Jacard similarity given the sets depicted in (8.13) 


FIGURE 8.13: Jaccard similarity. 


4. Compute the Jaccard similarity of each pair of the following sets: 


i 12,14, 16, 18. 
iti 11, 12, 13, 14, 15. 
iii 11, 16, 17. 


The Kullback-Leibler Distance 


PRB-202 @ CH.PRB- 8.26. 

In this problem, you have to actually read 4 different papers, so you will probably not 
encounter such a question during an interview, however reading academic papers is an ex- 
cellent skill to master for becoming a DL researcher. 

Read the following papers which discuss aspects of the Kullback-Leibler divergence: 


i Bennet [2] 
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ii Ziv [29] 
iii Bigi [3] 
iv Jensen [1] 


The Kullback-Leibler divergence, which was discussed thoroughly in chap 4 is a meas- 
ure of how different two probability distribution are. As noted, the KL divergence of the 
probability distributions P, Q on a set X is defined as shown in Equation 8.11. 


Dxi(PIIQ) = Y P(a)log a (8.11) 
TEX 


Note however that since KL divergence is a non-symmetric information theoretical meas- 
ure of distance of P from Q, then it is not strictly a distance metric. During the past years, 
various KL based distance measures (rather than divergence based) have been introduced in 
the literature generalizing this measure. 

Name each of the following KL based distances: 


Dxrimi(PlQ) = Dr (PQ) + Dx (Q]||P) (8.12) 
Drool PIIQ) = Y (Plz) - Q(0)log 24, (8.13) 
TEX Q(x) 
Dycros(PIIQ) = 5 [Dax (PIPE) + Der (QI E2)) ea 
Dxrpa(P||Q) = max (Dx (P[1Q) + Dx (Q||P)) (8.15) 
MinHash 


Read the paper entitled Detecting near-duplicates for web crawling [12] and answer the 
following questions. 


PRB-203 @ CH.PRB- 8.27. 
What is the goal of hashing? Draw a simple HashMap of keys and values. Explain what 
is a collision and the notion of buckets. Explain what is the goal of MinHash. 
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8.2. PROBLEMS 


PRB-204 @ CH.PRB- 8.28. 
What is Locality Sensitive Hashing or LSH? 


PRB-205 @ CH.PRB- 8.29. 


Complete the sentence: LSH main goal is to [...] the probability of a colliding, for 
similar items in a corpus. 


8.2.4 Perceptrons 


The Single Layer Perceptron 


PRB-206 O CH.PRB- 8.30. 


1. complete the sentence: In a single-layer feed-forward NN, there are [...] input(s) 
and [...]. output layer(s) and no [...] connections at all. 


PRB-207 O CH.PRB- 8.31. 


In its simplest form, a perceptron (8.16) accepts only a binary input and emits a binary 
output. The output, can be evaluated as follows: 


output = | Oe e Oy (8.16) 


1, YU +b>0 


Where weights are denoted by w; and biases are denoted by b. Answer the following ques- 
tions: 


1. True or False: If such a perceptron is trained using a labelled corpus, for each parti- 
cipating neuron the values w, and b are learned automatically. 


2. True or False: If we instead use a new perceptron (sigmoidial) defined as follows: 


o(wa + b) (8.17) 
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where o is the sigmoid function: 


a(z)= ; (8.18) 


Then the new perceptron can process inputs ranging between 0 and 1 and emit output 
ranging between 0 and 1. 


. Write the cost function associated with the sigmoidial neuron. 


. If we want to train the perceptron in order to obtain the best possible weights and 


biases, which mathematical equation do we have to solve? 


. Complete the sentence: To solve this mathematical equation, we have to apply [...] 


What does the following equation stands for? 


1 
VO =-YVC, (8.19) 


Where: i 


Complete the sentence: Due to the time-consuming nature of computing gradients for 
each entry in the training corpus, modern DL libraries utilize a technique that gauges 
the gradient by first randomly sampling a subset from the training corpus, and then 
averaging only this subset in every epoch. This approach is known as [...]. The actual 
number of randomly chosen samples in each epoch is termed [...]. The gradient itself 
is obtained by an algorithm known as [...]. 


The Multi Layer Perceptron 


PRB-208 O CH.PRB- 8.32. 

The following questions refer to the MLP depicted in (9.1).The inputs to the MLP in 
(9.1) are xı = 0.9 and x = 0.7 respectively, and the weights w, = —0.3 and wa, = 0.15 
respectively. There is a single hidden node, H,. The bias term, B1 equals 0.001. 
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Inputs Sum 


FIGURE 8.14: Several nodes in a MLP. 


1. We examine the mechanism of a single hidden node, H,. The inputs and weights go 
through a linear transformation. What is the value of the output (out1) observed at 
the sum node? 


2. What is the value resulting from the application the sum operator? 


3. Verify the correctness of your results using PyTorch. 


Activation functions in perceptrons 


PRB-209 @ CH.PRB- 8.33. 
The following questions refer to the MLP depicted in (8.15). 


1. Further to the above, the ReLU non-linear activation function g(z) = max{0, z} is 
applied (8.15) to the output of the linear transformation. What is the value of the 


output (out2) now? 


Inputs Hidden Sum Activation 


—0.3 out S 
ei 0.001 


FIGURE 8.15: Several nodes in a MLP. 
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| 2. Confirm your manual calculation using PyTorch tensors. 


Back-propagation in perceptrons 


PRB-210 @ CH.PRB- 8.34. 
Your co-worker, an postgraduate student at M.I.T, suggests using the following activa- 
tion functions in a MLP. Which ones can never be back-propagated and why? 


i 


f(z) = |x| (8.21) 
ii 
f(x) =x (8.22) 
iti 
fla) = | d dd 7 (8.23) 
iv 
y :>0 
f(x)=4 == «<0 (8.24) 
0 x=0 


PRB-211 @ CH.PRB- 8.35. 
You are provided with the following MLP as depicted in 8.16. 
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FIGURE 8.16: A basic MLP 


The ReLU non-linear activation function g(z) = max{0,z} is applied to the hidden 
layers H,...H3 and the bias term equals 0.001. 

At a certain point in time it has the following values 8.17 all of which are belong to the 
type torch.FloatTensor: 


import torch 

x= torch.tensor([0.9,0.7]) # Input 
w= torch.tensor (| 

[0-37 Oo LS 
(0, 82,00 Sil], 
(0.37 ¿02717 
1) 
B= 


# Weights 
torch.tensor([0.002]) # Bias 


oN o oO A QU N e 


FIGURE 8.17: MLP operations. 


1. Using Python, calculate the output of the MLP at the hidden layers H... H3. 


2. Further to the above, you discover that at a certain point in time that the weights 
between the hidden layers and the output layers y, have the following values: 


wi= torch.tensor([ 
LOPES, —0 A ODIN 
LOAN Oo POY 
) 


ae Uù N m 
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What is the value observed at the output nodes y;..y2? 


3. Assume now that a Softmax activation is applied to the output. What are the resulting 
values? 


4, Assume now that a cross-entropy loss is applied to the output of the Softmax. 


L= -X ĝi log (yi) (8.25) 


What are the resulting values? 


The theory of perceptrons 


PRB-212 O CH.PRB- 8.36. If someone is quoted saying: 


MLP networks are universal function approximators. 


What does he mean? 


PRB-213 @ CH.PRB- 8.37. 
True or False: the output of a perceptron is 0 or 1. 


PRB-214 @ CH.PRB- 8.38. 
True or False: A multi-layer perceptron falls under the category of supervised machine 
learning. 


PRB-215 @ CH.PRB- 8.39. 
True or False: The accuracy of a perceptron is calculated as the number of correctly 
classified samples divided by the total number of incorrectly classified samples. 


Learning logical gates 
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8.2. PROBLEMS 


PRB-216 @ CH.PRB- 8.40. 
The following questions refer to the SLP depicted in (8.18). The weights in the SLP are 
wi = Land wz = 1 respectively. There is a single hidden node, H,. The bias term, B1 equals 


—2.5. 


FIGURE 8.18: A single layer perceptron. 


1. Assuming the inputs to the SLP in (8.18) are 
i xı = 0.0 and £ = 0.0 
il Tı = 0.0 and Ta = 1.0 
iii x; = 1.0 and zə = 0.0 
iv xı = 1.0 and zə = 1.0 
What is the value resulting from the application the sum operator? 


2. Repeat the above, assuming now that the bias term B1 was amended and equals —0.25. 


3. Define what is the perceptron learning rule. 


4. What was the most crucial difference between Rosenblatt's original algorithm and 
Hinton's fundamental papers of 1986: 
“Learning representations by back-propagating errors” [22] 


and 2012: 
“ImageNet Classification with Deep Convolutional Neural Networks” [18]? 


5. The AND logic gate [7] is defined by the following table (8.19): 
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Tı la Y 
1 1/1 
1 0/0 
0 1/0 
0 0/0 


FIGURE 8.19: Logical AND gate 


Can a perceptron with only two inputs and a single output function as an AND logic 
gate? Ifso, find the weights and the threshold and demonstrate the correctness of your 
answer using a truth table. 


8.2.5 Activation functions (rectification) 


We concentrate only on the most commonly used activation functions, those which 
the reader is more likely to encounter or use during his daily work. 


Sigmoid 


PRB-217 @ CH.PRB- 8.41. 

The Sigmoid s.(x) = Ez, also commonly known as the logistic function (Fig. 8.20), 
is widely used in binary classification and as a neuron activation function in artificial neural 
networks. Typically, during the training of an ANN, a Sigmoid layer applies the Sigmoid 
function to elements in the forward pass, while in the backward pass the chain rule is be- 
ing utilized as part of the backpropagation algorithm. In 8.20 the constant c was selected 
arbitrarily as 2 and 5 respectively. 
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—4,0 =30 —20 —10 1,0 2,0 3,0 4,0 


FIGURE 8.20: Examples of two sigmoid functions and an approximation. 


Digital hardware implementations of the sigmoid function do exist but they are expens- 
ive to compute and therefore several approximation methods were introduced by the research 
community. The method by [10] uses the following formulas to approximate the exponential 
function: 


e = Er(x) = 2 (8.26) 


Based on this formulation, one can calculate the sigmoid function as: 


1 1 


Pe 1 + 2-1.44x = | + 2715x (8.27) 


Sigmoid (x) 


1. Code snippet 8.21 provides a pure C++ based (e.g. not using Autograd) implementa- 
tion of the forward pass for the Sigmoid function. Implement the backward pass that 
directly computes the analytical gradients in C++ using Libtorch [19] style tensors. 
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NF oO e qq N He 


BO N He 


#include <torch/script.h> 
#include <vector> 


torch::Tensor sigmoid001 (const torch::Tensor & x ){ 
torch::Tensor sig = 1.0 / (1.0 + torch::exp(( -x))); 
return sig; 


} 


FIGURE 8.21: Forward pass for the Sigmoid function using Libtorch 


2. Code snippet 8.22 provides a skeleton for printing the values of the sigmoid and its 
derivative for a range of values contained in the vector v. Complete the code (lines 7-8) 
so that the values are printed. 


#include <torch/script.h> 
#include <vector> 
int main() { 
std: :vector<float> v{0.0, 0.1, 0.2, 0.3, 
> 0.4,0.5,0.6,0.7,0.8,0.9,0.99}; 
for (auto it = v.begin(); it != v.end(); ++1t) { 
torch::Tensor t0 = torch::tensor((*x1t)); 


FIGURE 8.22: Evaluation of the sigmoid and its derivative using Libtorch 


3. Manually derive the derivative of eq. 8.27, e.g: 


d 


z E | (8.28) 


1 + 9-1.5% 
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4. Implement both the forward pass for the Sigmoid function approximation eq. 8.27 that 
directly computes the analytical gradients in C++ using Libtorch [19]. 


5. Print the values of the Sigmoid function and the Sigmoid function approximation eq. 
8.27 for the following vector: 


v = 10.0; 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99] (8.29) 


Tanh 


PRB-218 @ CH.PRB- 8.42. 


The Hyperbolic tangent nonlinearity, or the tanh function (Fig. 8.23), is a widely used 
neuron activation function in artificial neural networks: 


sinh(x) e” — e™ 
an. = = 8.30 
Fran (2) cosh(z)  e*+e-* ee) 


4,0 5 


—o(x) = 4 * tanh 7 
o(x) = tanh Ẹ 


41-2 


FIGURE 8.23: Examples of two tanh functions. 


1. Manually derive the derivative of the tanh function. 
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2. Use this numpy array as an input [[0.37, 0.192, 0.571]] and evaluate the result using 


pure Python. 


3. Use the PyTorch based torch.autograd.Function class to write a custom Function 
that implements the forward and backward passes for the tanh function in Python. 


4. Name the class TanhFunction, and using the gradcheck method from torch.autograd, 
verify that your numerical values equate the analytical values calculated by gradcheck. 
Remember you must implement a method entitled .apply(x) so that the function can 
be invoked by Autograd. 


PRB-219 @ CH.PRB- 8.43. 


The code snippet in 8.24 makes use of the tanh function. 


nn001 
Talia 


1 

2 

3 

4 

5 Amn: 
6 DnS 
E ¡aa 
8 na: 
9 


omk 


import torch 


= nn.Sequential ( 
IS CAOS EZ) 
Tanh (), 

iaa (912, SLA, 
Tann 
minea r (iZ, LO), 
LogSoftmax (dim=1) 


1. What type of a neural network does nn001 in 8.24 represent? 


2. How many hidden layers does the layer entitles nn001 have? 


| PRB-220 @ CH.PRB- 8.44. 
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FIGURE 8.24: A simple NN based on tanh in PyTorch. 


8.2. PROBLEMS 


Your friend, a veteran of the DL community claims that MLPs based on tanh activation 
function, have a symmetry around 0 and consequently cannot be saturated. Saturation, so 
he claims is a phenomenon typical of the top hidden layers in sigmoid based MLPs. Is he 
right or wrong? 


PRB-221 O CH.PRE- 8.45. 
If we initialize the weights of a tanh based NN, which of the following approaches will 
lead to the vanishing gradients problem?. 


i Using the normal distribution, with parameter initialization method as suggested by 
Kaiming [14]. 


ii Using the uniform distribution, with parameter initialization method as suggested by 
Xavier Glorot [9]. 


iii Initialize all parameters to a constant zero value. 


PRB-222 @ CH.PRB- 8.46. 
You friend, who is experimenting with the tanh activation function designed a small 
CNN with only one hidden layer and a linear output (8.25): 


Activation Activation 
1 . 
Wwe a tanh(W® x) (woy) linear 
| at _—_ y aS act (2) —» (2) 


FIGURE 8.25: A small CNN composed of tanh blocks. 


He initialized all the weights and biases (biases not shown for brevity) to zero. What is 
the most significant design flaw in his architecture? 
Hint: think about back-propagation. 


ReLU 


| PRB-223 @ CH.PRB- 8.47. 
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The rectified linear unit, or ReLU g(z) = max{0, z} is the default for many CNN archi- 
tectures. It is defined by the following function: 


FreLu(1) = max(0, x) (8.31) 
Or: 
0 0 
freule) = | 0 a co (8.32) 


1. In what sense is the ReLU better than traditional sigmoidal activation functions? 


PRB-224 O CH.PRB- 8.48. 
You are experimenting with the ReLU activation function, and you design a small CNN 
(8.26) which accepts an RGB image as an input. Each CNN kernel is denoted by w. 


ReLU 3x3 


E xTTS «CTS 
9T xCTS CTS 


FIGURE 8.26: A small CNN composed of ReLU blocks. 


What is the shape of the resulting tensor W? 


PRB-225 O CH.PRE- 8.49. 
Name the following activation function where a € (0, 1): 


f(x) = E a (8.33) 
ax otherwise 
Swish 
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PRB-226 @ CH.PRB- 8.50. 

In many interviews, you will be given a paper that you have never encountered before, 
and be required to read and subsequently discuss it. Please read Searching for Activation 
Functions [21] before attempting the questions in this question. 


1. 


2 
3 
4, 
5 


In [21], researchers employed an automatic pipeline for searching what exactly? 


. What types of functions did the researchers include in their search space? 


. What were the main findings of their research and why were the results surprising? 


Write the formulae for the Swish activation function. 


. Plot the Swish activation function. 


8.2.6 Performance Metrics 


Comparing different machine learning models, tuning hyper parameters and learn- 
ing rates, finding optimal augmentations, are all important steps in ML research. Typ- 
ically our goal is to find the best model with the lowest errors on both the training 
and validation sets. To do so we need to be able to measure the performance of each 
approach /model /parameter setting etc. and compare those measures. For valuable 
reference, read: “Evaluating Learning Algorithms: A Classification Perspective” [22] 


Confusion matrix, precision, recall 


PRB-227 @ CH.PRB- 8.51. 

You design a binary classifier for detecting the presence of malfunctioning temperature 
sensors. Non-malfunctioning (N) devices are the majority class in the training corpus. While 
running inference on an unseen test-set, you discover that the Confusion Metrics (CM) has 
the following values 8.27: 
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Predicted 

P N 

Actual e 4 
N |24 1009 


FIGURE 8.27: A confusion metrics for functioning (N) temperature sensors. P stands for 
malfunctioning devices. 


1. Find: TP, TN, FP, EN and correctly label the numbers in table 8.27. 
2. What is the accuracy of the model? 
3. What is the precision of the model? 


4. What is the recall of the model? 


ROC-AUC 
The area under the receiver operating characteristic (ROC) curve, 8.73 known as the 
AUC, is currently considered to be the standard method to assess the accuracy of 
predictive distribution models. 


1 A 


Sensitivity 
o 
ES 


x 


o 
oN.., 


1- specificity 


FIGURE 8.28: Receiver Operating Characteristic curve. 
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PRB-228 @ CH.PRB- 8.52. 
Complete the following sentences: 


1. Receiver Operating Characteristics of a classifier shows its performance as a trade off 
between [...] and [...]. 


2. It is a plot of [...] vs. the [...]. In place of [...], one could also use [...] which are essen- 
tially {1 - ‘true negatives’}. 


3. A typical ROC curve has a concave shape with [...] as the beginning and [...] as the 
end point 


4, The ROC curve of a ‘random guess classifier’, when the classifier is completely con- 
fused and cannot at all distinguish between the two classes, has an AUC of [...] which 
is the [...] line in an ROC curve plot. 


PRB-229 @ CH.PRB- 8.53. 
The code 8.30 and Figure 8.29 are the output from running XGBOOST for a binary 
classification task. 


LOG_LOSS=0.0421598347226 


— Alc =0.984440 


08 10 


04 06 
False Positive Rate 


FIGURE 8.29: RUC AUC 
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1|XGBClassifier (base_score=0.5, colsample_bylevel=1, 

+ colsample_bytree=0.5, 

gamma=0.017, learning_rate=0.15, max_delta_step=0, max_depth=9, 
min_child_weight=3, missing=None, n_estimators=1000, nthread=-1, 
objective='binary:logistic', reg_alpha=0, reg_lambda=1, 
scale_pos_weight=1, seed=0, silent=1, 

>  subsample=0.9)shape: (316200, 6) 


a e O N 


7|>ROC AUC:0.984439608912 
si> LOG LOSS:0.0421598347226 


FIGURE 8.30: XGBOOST for binary classification. 


How would you describe the results of the classification?. 


8.2.7 NN Layers, topologies, blocks 


CNN arithmetics 


PRB-230 @ CH.PRB- 8.54. 
Given an input of size of n x n, filters of size f x f and a stride of s with padding of p, 
what is the output dimension? 


PRB-231 O CH.PRB- 8.55. 
Referring the code snippet in Fig. (8.31), answer the following questions regarding the 
VGG11 architecture [25]: 
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import torchvision 

import torch 

def main(): 

vgg11 = torchvision.models.vggll (pretrained=True) 
vgg_layers = vggll.features 
for param in vgg_layers.parameters(): 


param. requires_grad = False 


torch.rand 
vggll.eval 
for in example: 


[ite 

Eeracla., celo, 8, Sie, 12) y 
(ip Sp TO T2 
() 


out=vgg_layers (e) 
print (out .shape) 
if name == "_ main _ ": 


1 
2 
3 
4 
5 
6 
Zi 
8 
9| example = oicela ecu. Sa Zan, Za) y 
0 
1 
2 
3 
4 
5 
6 
MN Oa T 


FIGURE 8.31: CNN arithmetics on the VGG11 CNN model. 
1. In each case for the input variable determine the dimensions of the tensor 
which is the output of applying the VGG11 CNN to the respective input. 
2. Choose the correct option. The last layer of the VGG11 architecture is: 


i Conv2d 
ii MaxPool2d 
iii ReLU 


PRB-232 @ CH.PRB- 8.56. 

Still referring the code snippet in Fig. (8.31), and specifically to line 7, the code is 
amended so that the line is replaced by the line: 
veg_layers=vg8811.features[:3]| 
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1. What type of block is now represented by the new line? Print it using PyTorch. 


2. In each case for the input variable determine the dimensions of the tensor 
which is the output of applying the block: 


veg_layers=vg811.features[:3]| to the respective input. 


PRB-233 @ CH.PRB- 8.57. 

Table (8.1) presents an incomplete listing of the of the VGG11 architecture [25]. As 
depicted, for each layer the number of filters (i. e., neurons with unique set of parameters) are 
presented. 


Layer #Filters 


conv4_3 512 
fc6 4,096 
fc7 4,096 
output 1,000 


TABLE 8.1: Incomplete listing of the VGG11 architecture. 


Complete the missing parts regarding the dimensions and arithmetics of the VGG11 
CNN architecture: 


1. The VGG11 architecture consists of [...] convolutional layers. 


2. Each convolutional layer is followed by a [...] activation function, and five [...] opera- 
tions thus reducing the preceding feature map size by a factor of [...]. 


3. All convolutional layers have a [...] kernel. 
4. The first convolutional layer produces [...] channels. 


5. Subsequently as the network deepens, the number of channels [...] after each [...] oper- 
ation until it reaches [...]. 
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Dropout 


PRB-234 O CH.PRB- 8.58. 
A Dropout layer [26] (Fig. 8.32) is commonly used to regularize a neural network model 
by randomly equating several outputs (the crossed-out hidden node H) to 0. 


Dropout 


Oo 
H 
FIGURE 8.32: A Dropout layer (simplified form). 


For instance, in PyTorch [20], a Dropout layer is declared as follows (8.2): 


import torch 
import torch.nn as nn 
nn.Dropout (0.2) 


= 


N 


w 


CODE 8.2: Dropout in PyTorch 


Where nn.Dropout(0.2) (Line #3 in 8.2) indicates that the probability of zeroing an 
element is 0.2. 
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Q Ta 
(a 


FIGURE 8.33: A Bayesian Neural Network Model 


1 


A new data scientist in your team suggests the following procedure for a Dropout layer 
which is based on Bayesian principles. Each of the neurons 0,, in the neural network in (Fig. 
8.33) may drop (or not) independently of each other exactly like a Bernoulli trial. 

During the training of a neural network, the Dropout layer randomly drops out outputs 
of the previous layer, as indicated in (Fig. 8.32). Here, for illustration purposes, all four 
neurons are dropped as depicted by the crossed-out hidden nodes H,,. 


1. You are interested in the proportion 0 of dropped-out neurons. Assume that the chance 
of drop-out, 0, is the same for each neuron (e.g. a uniform prior for 0). Compute the 
posterior of 0. 


2. Describe the similarities of dropout to bagging. 


PRB-235 @ CH.PRB- 8.59. 
A co-worker claims he discovered an equivalence theorem where, two consecutive Dro- 
pout layers [26] can be replaced and represented by a single Dropout layer 8.34. 


Dra nn.Dropout(p=P) 


DTS) nn .Dropout(p=0Q) 


FIGURE 8.34: Two consecutive Dropout layers 


Hi realized two consecutive layers in PyTorch [20], declared as follows (8.3): 
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import torch 

import torch.nn as nn 
nn.Sequential ( 

¡a oa OZ Sa) 5 

REUS) 

nn.Dropout (p=P, inplace=True), 
nn.Dropout (p=0, inplace=True) 


IN FF Ww NY BR 
=, 
(>) 


CODE 8.3: Consequtive dropout in PyTorch 


Where nn.Dropout(0.1) (Line #6 in 8.3) indicates that the probability of zeroing an 
element is 0.1. 


1. What do you think about his idea, is he right or wrong? 


2. Either prove that he is right or provide a single example that refutes his theorem. 


Convolutional Layer 
The convolution layer is probably one of the most important layers in the theory and 
practice of modern deep learning and computer vision in particular. 
To study the optimal number of convolutional layers for the classification of two 
different types of the Ebola virus, a researcher designs a binary classification pipeline 
using a small CNN with only a few layers (8.35): 
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Input 


FreLu(2) = max(0, x) 


Raw 
A > = 
Mee Normali O 2 x 
Laa 2) Normalize > Mu 
ea S 3 @ A 3 
‘She N cio = 
rd U N 
U 


transforms.Normalize( 
mean = [ 0.485, 0.456, 0.406 ], 
std = [ 0.229, 0.224, 0.225 ]) 


Ebola? 
FIGURE 8.35: A CNN based classification system. 


Answer the following questions while referring to (8.35): 


PRB-236 O CH.PRB- 8.60. 


If he uses the following filter for the convolutional operation, what would be the resulting 
tensor after the application of the convolutional layer? 


3 


* ja 


FIGURE 8.36: A small filter for a CNN 


PRB-237 @ CH.PRB- 8.61. 
What would be the resulting tensor after the application of the ReLU layer (8.37)? 
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az^uop 
may 


FIGURE 8.37: The result of applying the filter. 


PRB-238 @ CH.PRB- 8.62. 
What would be the resulting tensor after the application of the MaxPool layer (8.78)? 


Pooling Layers 
A pooling layer transforms the output of a convolutional layer, and neurons in a pool- 
ing layer accept the outputs of a number of adjacent feature maps and merge their 
outputs into a single number. 


MaxPooling 


PRB-239 @ CH.PRB- 8.63. 
The following input 8.38 is subjected to a MaxPool2D(2,2) operation having 2 x 2 max- 
pooling filter with a stride of 2 and no padding at all. 
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FIGURE 8.38: Input to MaxPool2d operation. 


Answer the following questions: 
1. What is the most common use of max-pooling layers? 


2. What is the result of applying the MaxPool2d operation on the input? 


PRB-240 @ CH.PRB- 8.64. 

While reading a paper about the MaxPool operation, you encounter the following code 
snippet 9.1 of a PyTorch module that the authors implemented. You download their pre- 
trained model, and evaluate its behaviour during inference: 
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. 


N 


ics} 


import torch 
from torch import nn 
class MaxPool1001 (nn.Module) : 
def _init_ (self): 
super (MaxPoo1001, self).__init__() 
seltomat hieren Sequenitea culm 
torch.nn.Conv2d(3, 32, kernel_size=7, padding=2), 
torch.nn.BatchNorm2d (32), 
tOnch nm MaxBoo 1202 EZ) 
EorehannwMaxPook2cdi(2, 2) 


) 
def forward(self, x): 
print (x.data.shape) 
x = self.math(x) 
print (x.data.shape) 
xX = x.view(x.size(0), -1) 
print ("Final shape:{}",x.data.shape) 
return x 
model = MaxPool001 () 
model.eval () 
x = coroa. cane 3, 2o4, Bae) 
out=model . forward (x) 


CODE 8.4: A CNN in PyTorch 


The architecture is presented in 9.2: 
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torch.rand(1, 3, 224, 224) 

nn.Conv2d(3,32) 

| 
nn.BatchNorm2d(32) 

| 
nn.MaxPool2d(2,2) 

| 
nn.MaxPool2d(2,2) 


FIGURE 8.39: Two consecutive MaxPool layers. 


Please run the code and answer the following questions: 
1. In MaxPool2D(2,2), what are the parameters used for? 
. After running line 8, what is the resulting tensor shape? 


. Why does line 20 exist at all? 


BH Q N 


. Inline 9, there is a MaxPool2D(2,2) operation, followed by yet a second MaxPool2D(2,2). 
What is the resulting tensor shape after running line 9? and line 10? 


5. A friend who saw the PyTorch implementation, suggests that lines 9 and 10 may 
be replaced by a single MaxPool2D(4,4,) operation while producing the exact same 
results. Do you agree with him? Amend the code and test your assertion. 


Batch normalization, Gaussian PDF 
Recommended readings for this topic are “Batch Normalization: Accelerating Deep Net- 
work Training by Reducing Internal Covariate Shift” [16] and “Delving deep into rectifiers: 
Surpassing human-level performance on imagenet classification” [14]. 

A discussion of batch normalization (BN) would not be complete without a discus- 
sion of the Gaussian normal distribution. Though it would be instructive to develop 
the forward and backwards functions for a BN operation from scratch, it would also 
be quite complex. As an alternative we discuss several aspects of the BN operation 
while expanding on the Gaussian distribution. 
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The Gaussian distribution 


PRB-241 O CH.PRE- 8.65. 
1. What is batch normalization? 
2. The normal distribution is defined as follows: 


1 


ov 2T 


P(x) = E a (8.34) 


Generally i.i.d. X ~ N (u, 0?) however BN uses the standard normal distribution. 


What mean and variance does the standard normal distribution have? 
3. What is the mathematical process of normalization? 


4. Describe, how normalization works in BN. 


PRB-242 @ CH.PRB- 8.66. 
In python, the probability density function for a normal distribution is given by 8.40: 


. 


import scipy 
scipy.stats.norm.pdf (x, mu, sigma) 


N 


FIGURE 8.40: Normal distribution in Python. 


1. Without using Scipy, implement the normal distribution from scratch in Python. 


2. Assume, you want to back propagate on the normal distribution, and therefore you 
need the derivative. Using Scipy write a function for the derivative. 


BN 


| PRB-243 @ CH.PRB- 8.67. 
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Your friend, a novice data scientist, uses an RGB image (8.41) which he then subjects to 
BN as part of training a CNN. 


BatchNorm2D 


FIGURE 8.41: A convolution and BN applied to an RGB image. 


1. Help him understand, during BN, is the normalization applied pixel-wise or per colour 
channel? 


2. In the PyTorch implementation, he made a silly mistake 8.42, help him identify it: 
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import torch 
from torch import nn 
class BN1001 (nn.Module) : 
def init__(self): 
super (BNUOT, serf) n init () 
sen menni —seOreh mm Seguent ani 
torch.nn.Conv2d(3, 64, kernel_size=3, padding=2), 
) 


self.math= torch.nn.Sequential ( 


NV oo NX 0. 0 A 0 N e 


10 toreh ink ByshelaNonamAicl (Sy) 
1 torch- nn- PRETU 
12 torchi nin- Dropout 20110) 03) 


13 ) 
14| def forward(self, x): 


FIGURE 8.42: A mistake in a CNN 


Theory of CNN design 


PRB-244 @ CH.PRB- 8.68. 
True or false: An activation function applied after a Dropout, is equivalent to an activ- 
ation function applied before a dropout. 


PRB-245 @ CH.PRB- 8.69. 
Which of the following core building blocks may be used to construct CNNs? Choose all 


the options that apply: 

i Pooling layers 

ii Convolutional layers 
iii Normalization layers 


iv Non-linear activation function 
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v Linear activation function 


PRB-246 O CH.PRE- 8.70. 
You are designing a CNN which has a single BN layer. Which of the following core CNN 
designs are valid? Choose all the options that apply: 


i CONV — act + BN > Dropout >... 
ii CONV — act > Dropout + BN >... 
iii CONV > BN > act > Dropout >... 
iv BN + CONV — act > Dropout > ... 
v CONV > Dropout + BN > act >... 
vi Dropout + CONV > BN > act >... 


PRB-247 @ CH.PRB- 8.71. 
The following operator is known as the Hadamard product: 


OUT =AOB (8.35) 
Where: 
(AO B)ij = (Ajij (Bij (8.36) 
A scientist, constructs a Dropout layer using the following algorithm: 
i Assign a probability of p for zeroing the output of any neuron. 
ii Accept an input tensor T, having a shape S 
iii Generate a new tensor T € {0,1}% 


iv Assign each element in T‘ a randomly and independently sampled value from a Bernoulli 
distribution: 


T% ~ B(1,p) (8.37) 
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v Calculate the OUT tensor as follows: 


OUT =T'OT (8.38) 


You are surprised to find out that his last step is to multiply the output of a dropout layer 
with: 
1 
(8.39) 
1=p 
Explain what is the purpose of multiplying by the term ER 


PRB-248 @ CH.PRB- 8.72. 
Visualized in (8.43) from a high-level view, is an MLP which implements a well-known 


idiom in DL. 
FIGURE 8.43: A CNN block 


1. Name the idiom. 
2. What can this type of layer learn? 


3. A fellow data scientist suggests amending the architecture as follows (8.44) 
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FIGURE 8.44: A CNN block 


Name one disadvantage of this new architecture. 


4. Name one CNN architecture where the input equals the output. 


CNN residual blocks 


PRB-249 @ CH.PRB- 8.73. 
Answer the following questions regarding residual networks ([13]). 


1. Mathematically, the residual block may be represented by: 


y =x+ F(x) (8.40) 


What is the function F? 


2. In one sentence, what was the main idea behind deep residual networks (ResNets) as 
introduced in the original paper ([13])? 


PRB-250 @ CH.PRB- 8.74. 
Your friend was thinking about ResNet blocks, and tried to visualize them in (8.45). 
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8.2. PROBLEMS 


FIGURE 8.45: A resnet CNN block 


1. Assuming a residual of the form y = x + F(x), complete the missing parts in Fig. 
(8.45). 


2. What does the symbol ® denotes? 


3. A fellow data scientist, who had coffee with you said that residual blocks may compute 
the identity function. Explain what he meant by that. 


8.2.8 Training, hyperparameters 


Hyperparameter optimization 


PRB-251 @ CH.PRB- 8.75. 


A certain training pipeline for the classification of large images (1024 x 1024) uses the 
following Hyperparameters (8.46): 
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Hyperparameter Value 


Initial learning rate | 0.1 


Weight decay 0.0001 
Momentum 0.9 
Batch size 1024 


optimizer = optim.sSGD (model .parameters(), l1r=0.1, 
momentum=0.9, 
weight_decay=0.0001) 


trainLoader = torch.utils.data.DataLoader ( 
datasets. MARGE dara ie nan = True dovnload: rrue, 
transform=transforms.Compose ( [ 
transforms.ToTensor(), 

DOF 

pateni size=1024, shuttle=True) 


NV 0 NJ D 0 24 BW N e 


a 
© 


FIGURE 8.46: Hyperparameters. 


In your opinion, what could possibly go wrong with this training pipeline? 


PRB-252 @ CH.PRB- 8.76. 
A junior data scientist in your team who is interested in Hyperparameter tuning, wrote 
the following code (8.5) for spiting his corpus into two distinct sets and fitting an LR model: 
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from sklearn.model selection import train_test_split 
dataset = datasets.load_iris() 
X_train, X_test, y_train, y_test = 


train_test_split (dataset.data, dataset.target, test_size=0.2) 


clf = LogisticRegression (data_norm=12) 
GE y walkie. (OK jeresialig,, y EEEL) 


Dn oOo FF OB N e 


CODE 8.5: Train and Validation split. 


He then evaluated the performance of the trained model on the X test set. 
1. Explain why his methodology is far from perfect. 
2. Help him resolve the problem by utilizing a difference splitting methodology. 


3. Your friend now amends the code an uses: 


clf = GridSearchCV(method, params, scoring='roc_auc', cv=5) 
CILE ESLe (EIEIO 55 EREILL 1) 


m 


N 


Explain why his new approach may work better. 


PRB-253 @ CH.PRB- 8.77. 
In the context of Hyperparameter optimization, explain the difference between grid search 
and random search. 


Labelling and bias 


Recommended reading: 
“Added value of double reading in diagnostic radiology,a systematic review” [8]. 


| PRB-254 @ CH.PRB- 8.78. 
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Non-invasive methods that forecast the existence of lung nodules (8.47), is a precursor 
to lung cancer. Yet, in spite of acquisition standardization attempts, the manual detection of 
lung nodules still remains predisposed to inter mechanical and observer variability. What is 
more, it is a highly laborious task. 


FIGURE 8.47: Pulmonary nodules. 


In the majority of cases, the training data is manually labelled by radiologists who make 
mistakes. Imagine you are working on a classification problem and hire two radiologists for 
lung cancer screening based on low-dose CT (LDCT). You ask them to label the data, the 
first radiologist labels only the training set and the second the validation set. Then you hire 
a third radiologist to label the test set. 


1. Do you think there is a design flow in the curation of the data sets? 


2. A friend suggests that all there radiologists read all the scans and label them independ- 
ently thus creating a majority vote. What do you think about this idea? 


Validation curve ACC 


PRB-255 @ CH.PRB- 8.79. 
Answer the following questions regarding the validation curve visualized in (8.48): 
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ERR 5, VALID 
0,8 - : — TRAIN 
0,6 | 
0,4 - 
0,2 - 


FIGURE 8.48: A validation curve. 


1. Describe in one sentence, what is a validation curve. 
2. Which hyperparameter is being used in the curve? 


3. Which well-known metric is being used in the curve? Which other metric is commonly 
used? 


4, Which positive phenomena happens when we train a NN longer? 
5. Which negative phenomena happens when we train a NN longer than we should? 


6. How this negative phenomena is reflected in 8.48? 


Validation curve Loss 


PRB-256 O CH.PRB- 8.80. 
Refer to the validation log-loss curve visualized in (8.49) and answer the following ques- 
tions: 
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TN Ae pp N 
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LOSS/ACC 
oo o OS O DO RRE DOE Doms 
tt arta 
oF 


Secc hos SUURRRSABRAS SAG GSES AUG Sco OAS SHUN RS UNO SLURS 


feiste 


012345678 910111213141516171819202122232425262728293031323334353637383940414243444546474819505152535455565758595061626364556667686970717273475/677787980 


EPOCH 
FIGURE 8.49: Log-loss function curve. 


Name the phenomena that starts happening right after the marking by the letter E and 
describe why it is happening. 


Name three different weight initialization methods. 
What is the main idea behind these methods? 
Describe several ways how this phenomena can be alleviated. 


Your friend, a fellow data-scientist, inspects the code and sees the following Hyper- 
parameters are being used: 


Hyperparameter | Value 
Initial LR 0.00001 
Momentum 0:9 
Batch size 1024 


He then tells you that the learning rate (LR) is constant and suggests amending the 
training pipeline by adding the following code (8.50): 


8.2. PROBLEMS 


scheduler = optim.lr_scheduler.ReduceLROnPlateau (opt) 


pas 


FIGURE 8.50: A problem with the log-loss curve. 


What do you think about his idea? 


6. Provide one reason against the use of the log-loss curve. 


Inference 


PRB-257 @ CH.PRB- 8.81. 

You finished training a face recognition algorithm, which uses a feature vector of 128 
elements. During inference, you notice that the performance is not that good. A friend tells 
you that in computer vision faces are gathered in various poses and perspectives. He there- 
fore suggests that during inference you would augment the incoming face five times, run 
inference on each augmented image and then fuse the output probability distributions by 
averaging. 


1. Name the method he is suggesting. 


2. Provide several examples of augmentation that you might use during inference. 


PRB-258 @ CH.PRB- 8.82. 

Complete the sentence: If the training loss is insignificant while the test loss is signific- 
antly higher, the network has almost certainly learned features which are not present in an 
[...] set. This phenomena is referred to as [...] 


8.2.9 Optimization, Loss 


Stochastic gradient descent, SGD 


PRB-259 @ CH.PRB- 8.83. 
What does the term stochastic in SGD actually mean? Does it use any random number 
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generator? 


PRB-260 @ CH.PRB- 8.84. 
Explain why in SGD, the number of epochs required to surpass a certain loss threshold 
increases as the batch size decreases? 


Momentum 


PRB-261 @ CH.PRB- 8.85. 
How does momentum work? Explain the role of exponential decay in the gradient descent 
update rule. 


PRB-262 @ CH.PRB- 8.86. 
In your training loop, you are using SGD and a logistic activation function which is 
known to suffer from the phenomenon of saturated units. 


1. Explain the phenomenon. 


2. You switch to using the tanh activation instead of the logistic activation, in your 
opinion does the phenomenon still exists? 


3. In your opinion, is using the tanh function makes the SGD operation to converge 
better? 


PRB-263 @ CH.PRB- 8.87. 
Which of the following statements holds true? 


i In stochastic gradient descent we first calculate the gradient and only then adjust weights 
for each data point in the training set. 


ii In stochastic gradient descent, the gradient for a single sample is not so different from 
the actual gradient, so this gives a more stable value, and converges faster. 


iii SGD usually avoids the trap of poor local minima. 
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tv SGD usually requires more memory. 


Norms, L1, L2 


PRB-264 @ CH.PRB- 8.88. 
Answer the following questions regarding norms. 


1. Which norm does the following equation represent? 


[xl — x2| + |yl — y2| (8.41) 


2. Which formulae does the following equation represent? 
(ai — yi)” (8.42) 


1 


2 


3. When your read that someone penalized the L2 norm, was the euclidean or the Man- 
hattan distance involved? 


4. Compute both the Euclidean and Manhattan distance of the vectors: 
x1 = [6,1,4,5] and x2 = (2,8, 3, —1). 


PRB-265 @ CH.PRB- 8.89. 
You are provided with a pure Python code implementation of the Manhattan distance 
function (8.51): 


from scipy import spatial 

iS Op iA] 

A MAS Sia] 

cityolock = — espacial idursiancencikteyolock Gali s2) 
Print (UManhattan: Ur icieybillock) 


ao e U N e 


FIGURE 8.51: Manhattan distance function. 
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In many cases, and for large vectors in particular, it is better to use a GPU for imple- 
menting numerical computations. PyTorch has full support for GPU's (and its my favourite 
DL library ... ), use it to implement the Manhattan distance function on a GPU. 


PRB-266 @ CH.PRB- 8.90. 

Your friend is training a logistic regression model for a binary classification problem 
using the L2 loss for optimization. Explain to him why this is a bad choice and which loss he 
should be using instead. 


8.3 Solutions 
8.3.1 Cross Validation 


On the significance of cross validation and stratification in particular, refer to “A study 
of cross-validation and bootstrap for accuracy estimation and model selection” [17]. 


CV approaches 


SOL-177 @ CH.SOL- 8.1. 
The first approach is a leave-one-out CV (LOOCV) and the second is a K-fold cross- 
validation approach. a 


SOL-178 y CH.SOL- 8.2. 

Cross Validation is a cornerstone in machine learning, allowing data scientists to take 
ull gain of restricted training data. In classification, effective cross validation is essential to 
making the learning task efficient and more accurate. A frequently used form of the technique 
is identified as K-fold cross validation. Using this approach, the full data set is divided into K 
randomly selected folds, occasionally stratified, meaning that each fold has roughly the 
same class distribution as the overall data set. Subsequently, for each fold, all the other 
(K — 1) folds are used for training, while the present fold is used for testing. This process 
guarantees that sets used for testing, are not used by a classifier that also saw it during 
training. 


K-Fold CV 
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8.3. SOLUTIONS 


SOL-179 UY CH.SOL- 8.3. 
True. We never utilize the test set during a K-fold CV process. a 


SOL-180 Ud CH.SOL- 8.4. 
True. This is the average of the individual errors of K estimates of the test error: 


MSE,,..., MSE; (8.43) 


SOL-181 & CH.SOL- 8.5. 

The correct answer is: A 5-fold cross-validation approach results in 5-different model in- 
stances being fitted. It is a common misconception to think that in a K-fold approach the same 
model instance is repeatedly used. We must create a new model instance in each fold. m 


SOL-182 y CH.SOL- 8.6. 
The correct answer is: we compute the cross-validation performance as the arithmetic 
mean over the K performance estimates from the validation sets. a 


Stratification 


SOL-183 Uy CH.SOL- 8.7. 

The correct answer is: 3-fold CV. A k-fold cross-validation is a special case of cross- 
validation where we iterate over a dataset set k times. In each round, we split the dataset 
into k parts: one part is used for validation, and the remaining k — 1 parts are merged into 
a training subset for model evaluation. Stratification is used to balance the classes in the 
traming and validation splits in cases where the corpus is imbalanced. m 


LOOCV 


SOL-184 UY CH.SOL- 8.8. 


1. True: In (LOOCV) K = N the full sample size. 


2. False: There is no way of a-priori finding an optimal value for K, and the relationship 
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between the actual sample size and the resulting accuracy is unknown. 


8.3.2 Convolution and correlation 


The convolution operator 


SOL-185 UY CH.SOL- 8.9. 


1. This is the definition of a convolution operation on the two signals f and g. 


2. In image processing, the term g(t) represents a filtering kernel. 


SOL-186 Uy CH.SOL- 8.10. 


1. True. These operations have two key features: they are shift invariant, and they are 
linear. Shift invariance means that we perform the same operation at every point in the 
image. Linearity means that this operation is linear, that is, we replace every pixel with 
a linear combination of its neighbours 


2. True. See for instance Eq. (8.3). 
3. True. 


The correlation operator 


SOL-187 Y CH.SOL- 8.11. 
1. True. 
2. True. 
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SOL-188 Y CH.SOL- 8.12. 
A convolution operation is just like correlation, except that we flip over the filter both 
horizontally and vertically before correlating. 


M-1 N-1 
f(x,y) 9 h(z,y) = DD DYE fm, n)h(a +m, y +2) (8.44) 


m=0 n=0 


Padding and stride 
Recommended reading : “A guide to convolution arithmetic for deep learning by Vincent 
Dumoulin and Francesco Visin (2016)” [22]. 


SOL-189 UY CH.SOL- 8.13. 


1. The Valid padding only uses values from the original input; however, when the data 
resolution is not a multiple of the stride, some boundary values are ignored entirely in 
the feature calculation. 


2. The Same padding ensures that every input value is included, but also adds zeros near 
the boundary which are not in the original input. 


SOL-190 UY CH.SOL- 8.14. 
True. Contrast this with the two other types of convolution operations. a 


SOL-191 UY CH.SOL- 8.15. 


El ix [E] 1 (8.45) 


Iso1-192 CH.SOL- 8.16. 
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| A is the correct choice. a 


SOL-193 UY CH.SOL- 8.17. 
A represents the VALID mode while B represents the SAME mode. a 


SOL-194 Uy CH.SOL- 8.18. 


1. The resulting output has a shape of 4 x 4. 


2. Convolution operation 


[ [Su 3 3. a Le] 
3s 2e 03% ez] 
Bi, Bp 3s « Le] 
Se Ore. soe y Lo 
SS = La] 
A oe] 

CI 2a. Ow Za.) 

Zi De Lisa) 
Ze Ose 481 


3. By definition, convolutions in the valid mode, reduce the size of the resulting input 
tensor. 


NNNDNDy 


OOOO 
OOOO 


Kernels and filters 


SOL-195 Uy CH.SOL- 8.19. 
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8.3. SOLUTIONS 


1. Flipping by 180 degrees we get: 


a | oe oe | (8.46) 


2. The Sobel filter which is being frequently used for edge detection in classical computer 
vision. 


SOL-196 & CH.SOL- 8.20. 
The resulting complexity is given by: 


K?wh (8.47) 


Convolution and correlation in python 


SOL-197 Y CH.SOL- 8.21. 


1. Convolution operation: 
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import numpy as np 

def convolution(A,B): 
1_A = np.size(A) 

= np.size(B) 


1a 
G npr zeros (LN IBA) 


| Ww 


for m in np.arange(1_A): 


for n in np.arange(1_B): 
C[m+n] = Cimin] + A[m]*B[n] 


return C 


FIGURE 8.52: Convolution and correlation in python 


2. Correlation operation: 


def crosscorrelation(A,B): 
2| return convolution(np.conJ(A),B[::-1]) 


= 


FIGURE 8.53: Convolution and correlation in python 


Separable convolutions 


SOL-198 y CH.SOL- 8.22. 


1. No.Since images are usually stored as discrete pixel values one would have to use a 
discrete approximation of the Gaussian function on the filtering mask before performing 
the convolution. 


2. No. 
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3. Yes it is separable, a factor that has great implications. For instance, separability means 
that a 2D convolution can be reduced to two consequent 1D convolutions reducing the 
computational runtime from O (n? m?) to O (n? m). 


8.3.3 Similarity measures 


Image, text similarity 


SOL-199 Y CH.SOL- 8.23. 
The algorithm presented in (8.12) normalizes the input vector. This is usually done prior 
to applying any other method to the vector or before persisting a vector to a database of FVs. 
a 


SOL-200 UY CH.SOL- 8.24. 


1. The algorithm presented in (8.1) is one of the most commonly used image similarity 
measures and is entitled cosine similarity. It can be applied to any pair of images. 

2. The mathematical formulae behind it is: 
The cosine similarity between two vectors: 
u = {u1,U2,...,un} and v = {v1, vo,..., un} is defined as: 

WV ES UV; 


Tallo (Shaw) (EL?) 


Thus, the cosine similarity between two vectors measures the cosine of the angle 
between the vectors irrespective of their magnitude. It is calculated as the dot product 
of two numeric vectors, and is normalized by the product of the length of the vectors. 


sim(u, v) 


3. The minimum and maximum values it can return are 0 and 1 respectively. Thus, a 
cosine similarity value which is close to 1 indicated a very high similarity while that 
close to O indicates a very low similarity. 


4, It represents the negative distance in Euclidean space between the vectors. 
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Jacard similarity 


SOL-201 @ CH.SOL- 8.25. 
1. The general formulae for the Jaccard similarity of two sets is given as follows: 


_|4nB 


La [AU B| 


2. That is, the ratio of the size of the intersection of A and B to the size of their union. 


3. The Jaccard similarity equals: 


4. Given (8.13) 


For the three combinations of pairs above, we have 


NIT DONT REO] =e 


J({11, 16, 17}, {12, 14, 16, 18}) 
J({11, 12, 13, 14, 15}, {11,16,17}) = 


J({11, 12, 13, 14, 15}, {12, 14,16,18} = 


The Kullback-Leibler Distance 


SOL-202 Uy CH.SOL- 8.26. 
Each KLD corresponds to the definition of: 


i Jensen [1] 
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ii Bennet [2] 
iii Bigi [3] 
iv Ziv [29] 


MinHash 


Read the paper entitled Detecting near-duplicates for web crawling [12] and answer the 
following questions. 


SOL-203 Ud CH.SOL- 8.27. 
A Hashing function (8.54) maps a value into a constant length string that can be com- 


pared with other hashed values. 
HashKey-1 gag Vat | 
HashKey-2 pee Vara | 


FIGURE 8.54: The idea of hashing 


HashKey-0 


HashKey-K 


The idea behind hashing is that items are hashed into buckets, such that similar items 
will have a higher probability of hashing into the same buckets. 

The goal of MinHash is to compute the Jaccard similarity without actually computing the 
intersection and union of the sets, which would be slower. The main idea behind MinHash 
is to devise a signature scheme such that the probability that there is a match between the 
signatures of two sets, Sı and S», is equal to the Jaccard measure [12]. 
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SOL-204 y CH.SOL- 8.28. 

Locality-Sensitive Hashing (LSH) is a method which is used for determining which items 
in a given set are similar. Rather than using the naive approach of comparing all pairs of items 
within a set, items are hashed into buckets, such that similar items will be more likely to hash 
into the same buckets. 


SOL-205 Y CH.SOL- 8.29. 
Maximise. 


8.3.4 Perceptrons 


The Single Layer Perceptron 


SOL-206 y CH.SOL- 8.30. 
Answer: one, one, feedback. 


SOL-207 Y CH.SOL- 8.31. 


1. True. 
2. True. 


O(w,) = Y liv) — alz, w, b) (8.48) 


where w denotes the collection of all weights in the network, b all the biases, n is the 
total number of training inputs and a(x, w, b) is the vector of outputs from the network 
which has weights w, biases b and the input x. 


arg min C(w, b). (8.49) 


5. Gradient descent. 


8.3. SOLUTIONS 


6. The gradient. 


7. Stochastic gradient descent. Batch size. Back-propagation. 


The Multi Layer Perceptron 


SOL-208 @ CH.SOL- 8.32. 
1. This operation is a dot product with the given weights. Therefore: 


out = 11 * W1 + T2 * wa + b1 = 


8.50 
0.9 x (—0.3) + 0.7 x 0.15 = —0.164 0) 


2. This operation (sum) is a dot product with the given weights and with the given bias 
added. Therefore: 


outl = 11, * W1 + Tə * Wo + b1 = 


(8.51) 
0.9 x (—0.3) + 0.7 x 0.15 + 0.001 = —0.165 


3. Code snippet 8.55 provides a pure PyTorch-based implementation of the MLP operation. 


import torch 
# .type(torch.FloatTensor) 
X torch tencon (105 S10), 71) )} 
Corchrecensor (SOS @ 5 dl 
B= torch.tensor([0.001]) 
( 
( 


print (torch.sum(xx*w)) 
print (torch.sum(xx*w) + B) 


II DOI ADN BF 
ES 
ll 


FIGURE 8.55: MLP operations. 
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Activation functions in perceptrons 


SOL-209 & CH.SOL- 8.33. 
1. Since by definition: 


1 ifr>0 


a (8.52) 


freuls) = | 


And the output of the linear sum operation was —0.164 then, the output out2 = 0. 


2. Code snippet 8.56 provides a pure PyTorch-based implementation of the MLP operation. 


import torch 


> ora censor (OTS ORD 
w= Loneh. tensor (ISS Oso) 
Ome hmems or (OR OOMs|») 
m ( 


print (torch.sum(x*w) ) 
print (torch.sum(xx*w) + B) 


print (torch.relu(torch.sum(x*w + B))) 


NOD Go pe U ND e 
wW 
ll 


FIGURE 8.56: MLP operations. 


Back-propagation in perceptrons 


SOL-210 Y CH.SOL- 8.34. The answers are as follows: 


1. Non-differentiable at 0. 
2. Non-differentiable at 0. 
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3. Even though for x # 0: 
1 


1 1 
f'(x) = sin = — = cos +, (8.53) 
2: £ T 


the function is still non-differentiable at 0. 


4. Non-differentiable at 0. 


SOL-211 Y CH.SOL- 8.35. 


1. Fig 8.57 uses a loop (inefficient but easy to understand) to print the values: 
8 P y P 


for i in range(0,w.size(0)): 

print (torch.relu(torch.sum(xx*w[i1]) + B)) 
tensor([0.]) 

= censor (0m) 

> tensor([0.6630]) 


ao e o N e 
V 


FIGURE 8.57: MLP operations- values. 


2. The values at each hidden layer are depicted in 8.58 
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Output 


0.0 
0.0 
0.6630 


FIGURE 8.58: Hidden layer values, simple MLP. 


3. Fig 8.59 uses a loop (inefficient but easy to understand) to print the values: 


tonehatensor (OO 00 OSOS 01) noti: 


Xx 

wl= torch.tensor ([ 

(0. 137,048 0229 

101100 32/00 19 5 

]) .type(torch.FloatTensor) # Weights 
for i in range(0,wl.size(0)): 

print (torch.sum(x1*w1[i])) 

> menso (Os SSZ) 

> ESTO (0-5230) 


NV 0 NJ Dd 0 Ba UN e 


FIGURE 8.59: MLP operations- values at the output. 


4. We can apply the Softmax function like so 8.60: 
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[> roren. tenso (100,0. 0-0 ISSO) + Eoee 
l= torch.tensor ([ 


deal 
2 |w1 
el TO 15r =0 46, 0, SS, 

10.10, 032770 191 y 

5|]) .type(torch.FloatTensor) # Weights 

ote. torch. censor (torchi sua (le, wile O lesa 1 
eo Clo, still A. teca 110) 

s |print (outl) 

s|yhat = torch.softmax(outl, dim=0) 
o|print (yhat) 

it || => ceneo (LLO, S912 I, 

2 
3 
4 


-0.523811 
ESAS O Cll (05 MAON 
[0.2860]]) 


FIGURE 8.60: MLP operations- Softmax. 


5. For the cross-entropy loss, we use the Softmax values and calculate the result as follows: 


—1.0 x log(0.7140) — 0.0 * log(0.2860) = 1.31 (8.54) 


The theory of perceptrons 


SOL-212 UY CH.SOL- 8.36. 

He means that theoretically [6], a non-linear layer followed by a linear layer, can ap- 
proximate any non-linear function with arbitrary accuracy, provided that there are enough 
non-linear neurons 


Lam CH.SOL- 8.37. True m 
boran CH.SOL- 8.38. True m 
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SOL-215 y CH.SOL- 8.39. 
False. Divided by the training samples, not the number of incorrectly classified samples. = 


Learning logical gates 


SOL-216 UY CH.SOL- 8.40. 


1. The values are presented in the following table (8.61): 


Bias = —2.5 
Input | Weighted sum | Output 
(0,0) -2.5 0 
(0,1) -1.5 0 
(1,0) -1.5 0 
(1,1) -0.5 0 
FIGURE 8.61: Logical AND: B=-2.5 


2. The values are presented in the following table (8.62): 


Bias = —0.25 
Input | Weighted sum | Output 
(0,0) -0.25 0 
(0,1) -0.75 0 
(1,0) -0.75 0 
(1,1) 1.75 1 
FIGURE 8.62: Logical AND: B=-0.25 


3. The perceptron learning rule is an algorithm that can automatically compute optimal 


weights for the perceptron. 
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4. The main addition by [22] and [18] was the introduction of a differentiable activation 
function. 


5. if we select w1 = 1;w2 = 1 and threshold=1. We get: 


ti= 1; s= 1l; 
n=1x1+4+1x1=2,thus,y = 1 


£i = 1,12 = =1: 
n=1x1+1x(-1) =0,thus,y = —1 (8.55) 
Tı = —1, £9 =1: 
n=1x(-1)+1x1=0,thus,yy = —1 
Tı = —1, 12 = —=1: 


n =1 x (—1)+1 x (-1) = -2 thus y = —1 


Or summarized in a table (8.63): 


AND gate 


in, | ino | out 


0 
0 
1 
1 


ee ee) 


0 
1 
0 
1 


FIGURE 8.63: Logical AND gate 


8.3.5 Activation functions (rectification) 


We concentrate only on the most commonly used activation functions, those which 
the reader is more likely to encounter or use during his daily work. 


Sigmoid 
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SOL-217 Y CH.SOL- 8.41. 


1. Remember that the analytical derivative is of the sigmoid: 


L s(a) = La Hey) (8.56) 

£ s(a) =-1((1 + T +es) (8.57) 

£ s(a) = A + le) (8.58) 
£ a(z) = -1((1 + e) D) (0 + e (E (-2))) (8.59) 
£ oa) = -10 + CANNES) (8.60) 

£ s(a) = (1 1ed) (861) 

Esla) = are) (8.62) 

Pesa) = z B (8.63) 


Code snippet 8.64 provides a pure C++ based implementation of the backward pass that 
directly computes the analytical gradients in C++. 


#include <torch/script.h> 
#include <vector> 


torch::Tensor sigmoid001_d(torch::Tensor & x) { 
torch::Tensor s = sigmoid001 (x); 
return (1 - s) x S; 


NO a e U N e 


FIGURE 8.64: Backward pass for the Sigmoid function using Libtorch. 
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2. Code snippet 8.65 depicts one way of printing the values. 
1*include <torch/script.h> 
2| #include <vector> 
s3 int main() ( 
4 std: :vector<float> v{0.0, 0.1, 0.2, 0.3, 
Es 0.40e5, 0:6077 0:8,0.97 0:997; 
5 for (auto it = v.begin(); it != v.end(); ++it) { 


KN e O © oœ 


torch::Tensor tO = torch::tensor((xit)); 
Std cout << (Gert) << UY << 

+  sigmoid001(t0) .data() .detach () .i¡tem() 
.toFloat()<< "," 


<< sigmoid001_d (t0) .data() .detach() .¡tem() .toFloat () 


< “ad 


FIGURE 8.65: Evaluation of the sigmoid and its derivative in C++ using Libtorch. 


3. The manual derivative of eq. 8.27 is: 


91.5% 
3ln(2) x h 


(8.64) 


4. The forward pass for the Sigmoid function approximation eq. 8.27 is presented in code 


snippet 8.66: 
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1*include <torch/script.h> 

2 #include <vector> 

3)torch::Tensor sig_approx(const torch::Tensor & x ){ 

4 torch::Tensor sig = 1.0 / (1.0 + torch: :pow(2, ( -1.5xx))); 
5 return sig; 


FIGURE 8.66: Forward pass for the Sigmoid function approximation in C++ using Libtorch. 


5. The values are 8.67: : 


1*include <torch/script.h> 

2 #include <vector> 

s int main() { 

4) std: :vector<float> v{0.0, 0.1, 0.2, 0.3, 
> 0.4,0.5,0.6,0.7,0.8,0.9,0.99}; 


5 for (auto it = v.begin(); it != v.end(); ++1t) { 
6 torch::Tensor t0 = torch::tensor((*«it)); 
7 Sta: scout << (Lt) << E, << 
>  sigmoid001(t0) .data () .detach () .i¡tem() 
8 .toFloat ()<< ","<< sig_approx (t0).data().detach().item(). 
9 toFloat()<<'\n'; 


FIGURE 8.67: Printing the values for Sigmoid and Sigmoid function approximation in C++ 
using Libtorch. 


An the values are presented in Table 8.2: 
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Value Sig Approx 
0 0.5 0.5 

0.1 | 0.524979 | 0.52597 
0.2 | 0.549834 | 0.5518 
0.3 | 0.574443 | 0.577353 
0.4 | 0.598688 | 0.602499 
0.5 | 0.622459 | 0.627115 
0.6 | 0.645656 | 0.65109 
0.7 | 0.668188 | 0.674323 
0.8 | 0.689974 | 0.69673 
0.9 | 0.710949 | 0.71824 
0.99 | 0.729088 | 0.736785 


TABLE 8.2: Computed values for the Sigmoid and the Sigmoid approximation. 


Tanh 


SOL-218 UY CH.SOL- 8.42. 
The answers are as follows: 


1. The derivative is: 


Fonte) =1-— fian l2)? 


(8.65) 


2. Code snippet 8.68 implements the forward pass using pure Python. 


310 


Chapter 8 DEEP LEARNING 


py || Sab 


> 


oN D oO +2 Ww 


1|import numpy as np 


. type (torch.DoubleTensor) 
xT_np=xT.detach () .cpu() .numpy () 
print ("Input: \n",xT_np) 


tanh_values = np.tanh(xT_np) 
print ("Numpy:", tanh_values) 
> Nias 10. 35999172 0- 109693 Os 313093291] II 


torch.abs (torch.tensor([[0.37,0.192,0.571]],requires_grad=True) ) 


3. 
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FIGURE 8.68: Forward pass for tanh using pure Python. 


In order to implement a PyTorch based torch.autograd. Function function such as 
tanh, we must provide both the forward and backward passes implementation. The 
mechanism behind this idiom in PyTorch is via the use of a context, abbreviated ctx 
which is like a state manager for automatic differentiation. The implementation is de- 
picted in 8.69: 


8.3. 


SOLUTIONS 


w 


= 


import torch 


class TanhFunction (torch. autograd.Function): 
@staticmethod 
def forward(ctx, x): 

ctx.save_for_backward( x ) 


y = x.tanh() 
return y 

@staticmethod 

def backward(ctx, grad Output): 
input, = ctx.saved_tensors 


dy_dx = 1 / (input.cosh() x*x* 2) 
out = grad_output = dy_dx 

print ("backward: {}".format (out) ) 
return out 


FIGURE 8.69: Tanh in PyTorch. 


4, Code snippet 8.70 verifies the correctness of the implementation using gradcheck. 
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1|import numpy as np 

2| import numpy as np 

E => Trance eo. reason MO. 0 de Tall 
4|requires_grad=True) ) 

5|.type (torch.DoubleTensor) 
6|xT_np=xT.detach().cpu() .numpy () 

7[tanh_values = np.tanh(xT_np) 

8 |tanh_values_torch = tanhPyTorch (xT) 

g9/print ("Torch:", tanh_values_torch) 

o|from torch.autograd import gradcheck, Variable 
HE = TanhFunction.apply 

2 |test=gradcheck (lambda t: f(t), xT) 

3|print (test) 
AS 27 tor cla verstom Loa 7/4. 

> forche tensor (LOS a0 AOS o O Paty pesto rento at6/1)) 
> backward:tensor([[0.8747, 0.9640, 0.7336]] dtype=torch.float64) 


a 


a 


FIGURE 8.70: Invoking gradcheck on tanh. 


SOL-219 UY CH.SOL- 8.43. 


1. The type of NN is a MultiLayer Perceptron or MLP. 


2. There are two hidden layers. 


SOL-220 UY CH.SOL- 8.44. 
He is partially correct , see for example Understanding the difficulty of training deep 
feedforward neural networks [9]. a 
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SOL-221 UY CH.SOL- 8.45. 


Initialize all parameters to a constant zero value. When we apply the tanh function to an 


input which is very large, the output which is almost zero, will be propagated to the remaining 
partial derivatives leading to the well known phenomenon. 


SOL-222 UY CH.SOL- 8.46. 


During the back-propagation process, derivatives are calculated with respect to (WM) 
and also (W*). The design flaw: 


i Your friend initialized all weights and biases to zero. 
ii Therefore any gradient with respect to (W)) would also be zero. 
iti Subsequently, (W)) will never be updated. 


iv This would inadvertently cause the derivative with respect to (W®) to be always zero. 


v Finally, would also never be updated (W™)). 


ReLU 


SOL-223 Y CH.SOL- 8.47. 


The ReLU function has the benefit of not saturating for positive inputs since its derivative 
is one for any positive value. 


SOL-224 Y CH.SOL- 8.48. 
The shape is: 


3x3x3x16 


SOL-225 UY CH.SOL- 8.49. 


The activation function is a leaky ReLU which in some occasions may outperform the 
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ReLU activation function. a 


Swish 


SOL-226 Uy CH.SOL- 8.50. 
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. They intended to find new better-performing activation functions. 


. They had a list of basic mathematical functions to choose from, for instance the expo- 


nential families exp(), sin(), min and max. 


. Previous research found several activation function properties which were considered 


very useful. For instance, gradient preservation and non-monotonicity. However the 
surprising discovery was that the swish function violates both of these previously deemed 
useful properties. 


. The equation is: 


f(t) =x- olx) (8.66) 


. The plot is 8.71 


X 


| 


: L ] ] T T T T 
—1,0 —0,8 —0,6 —0,4 —0,2 0,2 0,4 0,6 0,8 1,0 


FIGURE 8.71: A plot of the Swish activation function. 


8.3. SOLUTIONS 


8.3.6 Performance Metrics 


Confusion matrix, precision, recall 


SOL-227 Y CH.SOL- 8.51. 
1. The values are labelled inside 8.27: 


Predicted 
P N 
TP=12 FN=7 
FP=24 TN=1009 


P 
N 


Truth 


FIGURE 8.72: TP, TN, FP, FN. 


2, 
12 + 1009 
= = 0.97 8.67 
ace = T0474 24 + 1009 Cen) 
3. 
rec = n = 0.333 (8.68) 
eS gga . 
4. 
recall = A 0.631 (8.69) 
TA ` ` 
a 
ROC-AUC 


The area under the receiver operating characteristic (ROC) curve, 8.73 known as the 
AUC, is currently considered to be the standard method to assess the accuracy of 
predictive distribution models. 
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Sensitivity 
o 
ES 


x 


o 
oN e... 


1- specificity 


FIGURE 8.73: Receiver Operating Characteristic curve. 


SOL-228 @ CH.SOL- 8.52. 

ROC allows to attest the relationship between sensitivity and specificity of a binary clas- 
sifier. Sensitivity or true positive rate measures the proportion of positives correctly classified; 
specificity or true negative rate measures the proportion of negatives correctly classified. Con- 
ventionally, the true positive rate tpr is plotted against the false positive rate fpr, which is one 


minus true negative rate. 


1. Receiver Operating Characteristics of a classifier shows its performance as a trade off 
between selectivity and sensitivity. 


2. It is a plot of ‘true positives’ vs. the ‘true negatives’. In place of ‘true negatives’, 
one could also use ‘false positives’ which are essentially 1 - ‘true negatives’. 


3. A typical ROC curve has a concave shape with (0,0) as the beginning and (1,1) as the 
end point 


4, The ROC curve of a ‘random guess classifier’, when the classifier is completely confused 
and cannot at all distinguish between the two classes, has an AUC of 0.5, the ‘x = y’ 


line in an ROC curve plot. 
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SOL-229 UY CH.SOL- 8.53. 


The ROC curve of an ideal classifier (100% accuracy) has an AUC of 1, with 0.0 “false 
positives’ and 1.0 ‘true positives’. The ROC curve in our case, is almost ideal, which may 
indicate over-fitting of the XGBOOST classifier to the training corpus. 


8.3.7 NN Layers, topologies, blocks 


CNN arithmetics 


SOL-230 Ul CH.SOL- 8.54. 
Output dimension: L x L x M where L = "2? + 1 


SOL-231 y CH.SOL- 8.55. 
The answers are as follows: 


1. Output dimensions: 


i torch.Size([1, 512, 7, 7]) 
ii torch.Size([1, 512, 16, 16]) 
iii torch.Size([1, 512, 22, 40]) 


2. The layer is MaxPool2d. 


SOL-232 y CH.SOL- 8.56. 
The answers are as follows: 


1. A convolutional block 8.74. 
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pa 


Sequential ( 

2| (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 
A) 

1): ReLU(inplace=True) 

2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, 

=>  ceil_mode=False 


Elg 
( 


51%) 


FIGURE 8.74: Convolutional block from the VGG11 architecture. 


2. The shapes are as follows: 


i torch.Size([1, 64, 112, 112]) 
ii torch.Size([1, 64, 256, 256]) 
iii torch.Size([1, 64, 352, 512]) 


SOL-233 YY CH.SOL- 8.57. 

The VGG11 architecture contains seven convolutional layers, each followed by a ReLU 
activation function, and five max-polling operations, each reducing the respective feature 
map by a factor of 2. All convolutional layers have a 3 x 3 kernel. The first convolutional 
layer produces 64 channels and subsequently, as the network deepens, the number of channels 
doubles after each max-pooling operation until it reaches 512. m 


Dropout 


SOL-234 UY CH.SOL- 8.58. 
1. The observed data, e.g the dropped neurons are distributed according to: 


tres. Un) 8 E Bern(0) (8.70) 
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Denoting s and f as success and failure respectively, we know that the likelihood is: 
p(x1,...,2n 10) =6*(1- 0) (8.71) 
With the following parameters a = p = 1 the beta distribution acts like Uniform prior: 
0 ~ Beta(a, 8), given a = 8 =1 (8.72) 


Hence, the prior density is: 


aaa (8.73) 
Therefore the posterior is: 


P (0/21, ee in) xp (xı, id ¿Tr 10) p(9) 
x 61 — eer — aye (8.74) 
— paa e gett 


2. In dropout, in every training epoch, neurons are randomly pruned with probability 
P = p sampled from a Bernoulli distribution. During inference, all the neurons are used 
but their output is multiplied by the a-priory probability P. This approach resembles to 
some degree the model averaging approach of bagging. 


SOL-235 GJ CH.SOL- 8.59. 
The answers are as follows: 


1. The idea is true and a solid one. 


2. The idiom may be exemplified as follows 8.75: 
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DNS SIA nn .Dropout(p=P) 


(1-P) neurons 


BIZ a e] nn.Dropout(p=0Q) 


(1-Q) neurons 


Single Dropout (new p=(1- (1-P) * (1-Q))) 


FIGURE 8.75: Equivalence of two consecutive dropout layers 


The probabilities add up by multiplication at each layer, resulting in a single dropout 
layer with probability: 
1=(1:=p)(l =4) (8.75) 


Convolutional Layer 


SOL-236 Y CH.SOL- 8.60. 
The result is (8.76): 


7 WA -7*3 +3*1 
3 xz CH -3*3 +(-6*1) 
6 X 1 — B -0o +2*1 
> E -2*3 +5*1 


5 


FIGURE 8.76: The result of applying the filter. 
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SOL-237 Ø CH.SOL- 8.61. 
The result is (8.77): 


freLu(1) = max(0, x) 


24 Berea 24 


CMe -3*3 +(-6*1) > 
MS =(-6)*3 +2*1 c i 0 
EE -2*3 +5*1 11 


FIGURE 8.77: The result of applying a ReLU activation. 


SOL-238 Ø CH.SOL- 8.62. 
The result is (8.78): 


<= 
24 E 
E 24 
366 
U 
= 
2 >; EE 
11 E 
U 


FIGURE 8.78: The result of applying a MaxPool layer. 


Pooling Layers 
MaxPooling 


322 


Chapter 8 DEEP LEARNING 


SOL-239 Ud CH.SOL- 8.63. 
The answers are as follows: 


1. A max-pooling layer is most commonly used after a convolutional layer in order to 
reduce the spatial size of CNN feature maps. 


2. The result is 8.79: 


FIGURE 8.79: Output of the MaxPool2d operation. 


SOL-240 UY CH.SOL- 8.64. 


1. In MaxPool2D(2,2), the first parameter is the size of the pooling operation and the 
second is the stride of the pooling operation. 


2. The BatchNorm2D operation does not change the shape of the tensor from the previous 
layer and therefore it is: 
torch. Size(([1, 32, 222, 222]). 


3. During the training of a CNN we use model.train() so that Dropout layers are fired. 
However, in order to run inference, we would like to turn this firing mechanism off, 
and this is accomplished by model.eval() instructing the PyTorch computation graph 
not to activate dropout layers. 


4. The resulting tensor shape is: 
torch.Size([1, 32, 55, 55)) 
If we reshape the tensor like in line 17 using: 
x= x.view(r.size(0), —1) 
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Then the tensor shape becomes: 
torch.Size([1, 96800]) 


5. Yes, you should agree with him, as depicted by the following plot 8.80: 


nn.MaxPool2d(2,2) 


| 
nn.MaxPool2d(2,2) 


FIGURE 8.80: A single MaxPool layer. 


=Œ nn.MaxPool2d(4,4) 


Batch normalization, Gaussian PDF 


The Gaussian distribution 


SOL-241 YY CH.SOL- 8.65. 
The answers are as follows: 


1. BN is a method that normalizes the mean and variance of each of the elements during 
training. 


2. X ~ N(0,1) a mean of zero and a variance of one. The standard normal distribution 
occurs when (o)? = 1 and y = 0. 


3. In order to normalize we: 


i Step one is to subtract the mean to shift the distribution. 
ii Divide all the shifted values by their standard deviation (the square root of the 
variance). 


4, In BN, the normalization is applied on an element by element basis. During training at 
each epoch, every element in the batch has to be shifted and scaled so that it has a zero 
mean and unit variance within the batch. 
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SOL-242 Y CH.SOL- 8.66. 


1. One possible realization is as follows 8.81: 


from math import sqrt 
import math 
def normDist(x, mu, sigSqrt): 
atan (Gl soja (2 ss maraani = Slaesjepae)) = mara e ee (05) < 


e o N e 


ay ( = imu) => 2 / sale Senet) 


FIGURE 8.81: Normal distribution in Python: from scratch. 


2. The derivative is given by 8.82: 


scipy.stats.norm.pdf (x, mu, sigma)*(mu — x)/sigmax*x*2 


= 


FIGURE 8.82: The derivative of a Normal distribution in Python. 


SOL-243 Y CH.SOL- 8.67. 


1. During training of a CNN, when a convolution is being followed by a BN layer, for 
each of the three RGB channels a single separate mean and variance is being computed. 


2. The mistake he made is using a BN with a batch size of 32, while the output from the 
convolutional layer is 64. 
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Theory of CNN design 


SOL-244 Y CH.SOL- 8.68. 
True. 


SOL-245 UY CH.SOL- 8.69. 
All the options may be used to build a CNN. a 


SOL-246 Y CH.SOL- 8.70. While the original paper ([16]) suggests that BN layers be 
used before an activation function, it is also possible to use BN after the activation function. 
In some cases, it actually leads to better results ([4]). 


SOL-247 Y CH.SOL- 8.71. 

When dropout is enabled during the training process, in order to keep the expected output 
at the same value, the output of a dropout layer must be multiplied with this term. Of course, 
during inference no dropout is taking place at all. m 


SOL-248 UY CH.SOL- 8.72. 


1. The idiom is a bottleneck layer ([27]), which may act much like an autoencoder. 


2. Reducing and then increasing the activations, may force the MLP to learn a more com- 
pressed representation. 


3. The new architecture has far more connections and therefore it would be prone to over- 
fitting. 


4. Once such architecture is an autoencoder ([28]). 


CNN residual blocks 


SOL-249 Uy CH.SOL- 8.73. 
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1. The function F is the residual function. 


2. The main idea was to add an identity connection which skips two layers all together. 


SOL-250 UY CH.SOL- 8.74. 


1. The missing parts are visualized in (8.83). 


X 


F(x) 


y=x+ F(x) 
FIGURE 8.83: A resnet CNN block 
2. The symbol represents the addition operator. 


3. Whenever F returns a zero, then the input X will reach the output without being 
modified. Therefore, the term identity function. 


8.3.8 Training, hyperparameters 


Hyperparameter optimization 


SOL-251 y CH.SOL- 8.75. 
The question states that image size is quite large, and the batch size is 1024, therefore it 
may fail to allocate memory on the GPU with an Out Of Memory (OOM) error message. This 
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is one of the most commonly faced errors when junior data-scientist start training models. 


SOL-252 GJ CH.SOL- 8.76. 


1. Since hs is tuning his Hyperparameters on the validation set, he would most probably 
overfit to the validation set which he also used for evaluating the performance of the 
model. 


2. One way would be to amend the splitting, is by first keeping a fraction of the training set 
aside, for instance 0.1, and then split the remaining .90 into a training and a validation 
set, for instance 0.8 and 0.1. 


3. His new approach uses GridSearchCV with 5-fold cross-validation to tune his Hyper- 
parameters. Since he is using cross validation with five folds, his local CV metrics would 
better reflect the performance on an unseen data set. 


SOL-253 y CH.SOL- 8.77. 

In grid search, a set of pre-determined values is selected by a user for each dimension in 
his search space, and then thoroughly attempting each and every combination. Naturally, with 
such a large search space the number of the required combinations that need to be evaluated 
scale exponentially in the number of dimensions in the grid search. 

In random search the main difference is that the algorithm samples completely random 
points for each of the dimensions in the search space. Random search is usually faster and may 
even produce better results. 


Labelling and bias 


Recommended reading: 
“Added value of double reading in diagnostic radiology,a systematic review” [8]. 


SOL-254 UY CH.SOL- 8.78. 
There is a potential for bias in certain settings such as this. If the whole training set 
is labelled only by a single radiologist, it may be possible that his professional history would 
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inadvertently generate bias into the corpus. Even if we use the form of radiology report reading 
known as double reading it would not be necessarily true that the annotated scans would be 
devoid of bias or that the quality would be better [8]. 


Validation curve ACC 


SOL-255 y CH.SOL- 8.79. 
The answers are as follows: 


1. A validation curve displays on a single graph a chosen hyperparameter on the hori- 
zontal axis and a chosen metric on the vertical axis. 


2. The hyperparameter is the number of epochs 


3. The quality metric is the error (1 -accuracy). Accuracy, error = (1 accuracy) or loss are 
typical quality metrics. 


4, The longer the network is trained, the better it gets on the training set. 


5. At some point the network is fit too well to the training data and loses its capability to 
generalize. While the classifier is still improving on the training set, it gets worse on 
the validation and the test set. 


6. At this point the quality curve of the training set and the validation set diverge. 


Validation curve Loss 


SOL-256 Uy CH.SOL- 8.80. 
The answers are as follows: 


1. What we are witnessing is phenomena entitled a plateau. This may happen when the 
optimization protocol can not improve the loss for several epochs. 


2. There possible methods are: 


i Constant 


ii Xavier/Glorot uniform 
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iii Xavier/Glorot normal 


3. Good initialization would optimally generate activations that produce initial gradients 
that are larger than zero. One idea is that the training process would converge faster if 
unit variance is achieved ([16]). Moreover, weights should be selected carefully so that: 


i They are large enough thus preventing gradients from decaying to zero. 


ii They are not too large causing activation functions to over saturate. 
4, There are several ways to reduce the problem of plateaus: 


i Add some type of regularization. 
ii In cases wherein the plateau happens right at the beginning, amend the way weights 
are initialized. 


iii Amending the optimization algorithm altogether, for instance using SGD instead 
of Adam and vice versa. 


5. Since the initial LR is already very low, his suggestion may worsen the situation since 
the optimiser would not be able to jump off and escape the plateau. 


6. In contrast to accuracy, Log loss has no upper bounds and therefore at times may be 
more difficult to understand and to explain. 


Inference 


SOL-257 Y CH.SOL- 8.81. 


1. Usually data augmentation, is a technique that is heavily used during training, espe- 
cially for increasing the number of instances of minority classes. In this case, augment- 
ations are using during inference and this method is entitled Test Time Augmentation 


(TTA). 


2. Here are several image augmentation methods for TTA, with two augmentations shown 
also in PyTorch. 
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Horizontal flip 
Vertical flip 
Rotation 
Scaling 

Crops 


transforms.HorizolntalFlip(p=1) (image) 
transforms.VerticalFlip(p=1) (image) 


= 


N 


FIGURE 8.84: Several image augmentation methods for TTA. 


SOL-258 UY CH.SOL- 8.82. 


i Unseen 


11 Overfitting 


8.3.9 Optimization, Loss 


Stochastic gradient descent, SGD 


SOL-259 Ud CH.SOL- 8.83. 
There is no relation to random number generation, the true meaning is the use of batches 
during the training process. 


SOL-260 & CH.SOL- 8.84. 
A larger batch size decreases the variance of the gradient estimation of SGD. Therefore, if 
your training loop uses larger batches, the model will converge faster. On the other hand, smal- 
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ler batch sizes increase the variance, leading to the opposite phenomena; longer convergence 
times. 


Momentum 


SOL-261 & CH.SOL- 8.85. 

Momentum introduces an extra term which comprises a moving average which is used 
in gradient descent update rule to exponentially decay the historical gradients Using such 
term has been demonstrated to accelerate the training process ([11]) requiring less epochs to 
converge. 


SOL-262 UY CH.SOL- 8.86. 


The answers are as follows: 


1. The derivative of the logistic activation function is extremely small for either negtive or 
positive large inputs. 


2. The use of the tanh function does not alleviate the problem since we can scale and 
translate the sigmoid function to represent the tanh function: 


tanh(z) = 20(2z) — 1 (8.76) 


While the sigmoid function is centred around 0.5, the tanh activation is centred around 
zero. Similar to the application of BN, centring the activations may aid the optimizer con- 
verge faster. Note: there is no relation to SGD; the issue exists when using other optimization 

unctions as well. a 


SOL-263 & CH.SOL- 8.87. 
The answers are as follows: 


i True. 


ii False. In stochastic gradient descent, the gradient for a single sample is quite different 
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from the actual gradient, so this gives a more noisy value, and converges slower 
iii True. 


iv False. SGD requires less memory. 


Norms, L1, L2 


SOL-264 Y CH.SOL- 8.88. 
1. The L2 norm. 


2. The Euclidean distance which is calculated as the square root of the sum of differences 
between each point in a set of two points. 


3. The Manhattan distance is an L1 norm (introduced by Hermann Minkowski) while the 
Euclidean distance is an L2 norm. 


4. The Manhattan distance is: 
6-2| + [1 -8|+/|4-3| + |5-— (-1 
6-2] + [1-8] +14-3| + I5- (1) an 
=4+7+1+6=18 


5. The Euclidean distance is: 


y (6-2)? + (1-8)? + (4-3)? + (5 — (-1)) (8.78) 


SOL-265 Y CH.SOL- 8.89. 

The PyTorch implementation is in (8.85). Note that we are allocating tensors on a GPU 
but first they are created on a CPU using numpy. This is also always the interplay between 
the CPU and the GPU when training NN models. Note that this only work if you have GPU 
available; in case there is no GPU detected, the code has a fallback to the CPU. 
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preset f 
import torch 
import numpy 


use_cuda = torch.cuda.is_available() 
device = torch.device("cuda" if use_cuda else "cpu") 
print (device) 


) 
x2np=numpy.array([2,8,3,-1]) 
xlt=torch.FloatTensor (x1np) .to(device) # Move to GPU if available 
x2t=torch.FloatTensor (x2np) .to (device) 
eiert = ol. epa (Coron ocw ele = AE Ay) Sul) )) 
dist 
>cuda 
Sensor (10:0995, devarce=Veudas 0") 


1 
2 
3 
4 
5 
6 
Z 
s|xlnp=numpy.array([6,1,4,5] 
9 
0 
1 
2 
3 
4 
5 


FIGURE 8.85: Manhattan distance function in PyTorch. 


SOL-266 Ud CH.SOL- 8.90. 

The L2 loss is suitable for a target, or a response variable that is continuous. On the other 
hand, in a binary classification problem using LR we would like the output to match either 
zero or one and a natural candidate for a loss function is the binary cross-entropy loss. m 
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CHAPTER 


JOB INTERVIEW MOCK EXAM 


A man who dares to waste one hour of time has not discovered the value of life. 


— Charles Darwin 
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Stressful events, such as a job interview, prompt concern and anxiety (as they do for 
virtually every person), but it’s the lack of preparation that fuels unnecessary nervous- 
ness. Many perceive the interview as a potentially threatening event. Testing your 
knowledge in AI using a mock exam, is an effective way to not only identifying your 
weaknesses and to pinpointing the concepts and topics that need brushing up, but 
also to becoming more relaxed in similar situations. Remember that at the heart of job 
interview confidence is feeling relaxed. 

Doing this test early enough, gives you a head-start before the actual interview, so 
that you can target areas that require perfection. The exam includes questions from 
a wide variety of topics in AI, so that these areas are recognised and it would then 
be a case of solving all the problems in this book over a period of few months to be 
properly prepared. Do not worry even if you can not solve any of the problems in the 
exam as some of them are quite difficult. 


DEEP LEARNING JOB INTERVIEW MOCK EXAM 
EXAM INSTRUCTIONS: 


YOU SHOULD NOT SEARCH FOR SOLUTIONS ON THE WEB. MORE GENERALLY, YOU 
ARE URGED TO TRY AND SOLVE THE PROBLEMS WITHOUT CONSULTING ANY REFER- 
ENCE MATERIAL, AS WOULD BE THE CASE IN A REAL JOB INTERVIEW. 


9.0.1 Rules 
REMARK: In order to receive credits, you must: 


i Show all work neatly. 
ii A sheet of formulas and calculators are permitted but not notes or texts. 
iii Read the problems CAREFULLY 
iv Do not get STUCK at any problem (or in local minima ...) for too much time! 
v After completing all problems, a double check is STRONGLY advised. 


vi You have three hours to complete all questions. 


342 


Chapter 9 JOB INTERVIEW MOCK EXAM 


9.1 Problems 
9.1.1 Perceptrons 


PRB-267 @ CH.PRB- 9.1. [PERCEPTRONS] 


The following questions refer to the MLP depicted in (9.1).The inputs to the MLP in 
(9.1) are xı = 0.9 and x = 0.7 respectively, and the weights w, = —0.3 and w = 0.15 
respectively. There is a single hidden node, H,. The bias term, B1 equals 0.001. 


Inputs Sum 


FIGURE 9.1: Several nodes in a MLP. 


1. We examine the mechanism of a single hidden node, Hı. The inputs and weights go 
through a linear transformation. What is the value of the output (out1) observed at 
the sum node? 


2. What is the resulting value from the application of the sum operator? 


3. Using PyTorch tensors, verify the correctness of your answers. 


9.1.2 CNN layers 


PRB-268 @ CH.PRB- 9.2. [CNN LAYERS] 
While reading a paper about the MaxPool operation, you encounter the following code 


snippet 9.1 of a PyTorch module that the authors implemented. You download their pre- 
trained model, and examine its behaviour during inference: 
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N 


ics} 


import torch 
from torch import nn 
class MaxPool1001 (nn.Module) : 
def _init_ (self): 
super (MaxPoo1001, self).__init__() 
seltomat hieren Sequenitea culm 
torch.nn.Conv2d(3, 32, kernel_size=7, padding=2), 
torch.nn.BatchNorm2d (32), 
tOnch nm MaxBoo 1202 EZ) 
EorehannwMaxPook2cdi(2, 2) 


) 
def forward(self, x): 
print (x.data.shape) 
x = self.math(x) 
print (x.data.shape) 
xX = x.view(x.size(0), -1) 
print ("Final shape:{}",x.data.shape) 
return x 
model = MaxPool001 () 
model.eval () 
x = coroa. cane 3, 2o4, Bae) 
out=model . forward (x) 


CODE 9.1: A CNN in PyTorch 


The architecture is presented in 9.2: 
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torch.rand(1, 3, 224, 224) 

nn.Conv2d(3,32) 

| 
nn.BatchNorm2d(32) 

| 
nn.MaxPool2d(2,2) 

| 
nn.MaxPool2d(2,2) 


FIGURE 9.2: Two consecutive MaxPool layers. 


Please run the code and answer the following questions: 
1. In MaxPool2D(2,2), what are the parameters used for? 
. After running line 8, what is the resulting tensor shape? 


. Why does line 20 exist at all? 


Aa Q N 


. In line 9, there is a MaxPool2D(2,2) operation, followed by yet 
a second MaxPool2D(2,2). What is the resulting tensor shape after running line 9? 
and line 10? 


5. A friend who saw the PyTorch implementation, suggests that lines 9 and 10 may 
be replaced by a single MaxPool2D(4,4,) operation while producing the exact same 
results. Do you agree with him? Amend the code and test your assertion. 


9.1.3 Classification, Logistic regression 


PRB-269 @ CH.PRB- 9.3. [CLASSIFICATION, LR] 


To study factors that affect the survivability of humans infected with COVID19 using 
logistic regression, a researcher considers the link between lung cancer and COVID19 as a 
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plausible risk factor. The predictor variable is a count of removed pulmonary nodules (Fig. 
9.3) in the lungs. 


FIGURE 9.3: Pulmonary nodules. 


The response variable Y measures whether the patient shows any remission (as in the 
manifestations of a disease, e. g. yes=1, no=0) when the pulmonary nodules count shifts up 
or down. The output from training a logistic regression classifier is as follows: 


Standard 

Parameter DF Estimate Error 
Intercept 1 -4.8792 1.0732 
Pulmonary nodules 1 0.0258 0.0194 


1. Estimate the probability of improvement when the count of removed pulmonary nod- 
ules of a patient is 33. 


2. Find out the removed pulmonary nodules count at which the estimated probability of 
improvement is 0.5. 


3. Find out the estimated odds ratio of improvement for an increase of 1, in the total 
removed pulmonary nodule count. 


4. Obtain a 99% confidence interval for the true odds ratio of improvement increase of 
1 in the total removed pulmonary nodule count. Remember that The most common 
confidence levels are 90%, 95%, 99%, and 99.9%. 
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Confidence Level z 
90% 1.645 
95% 1.960 
99% 2.576 
99.9% 3.291 


TABLE 9.1: Common confidence levels 


Table 9.1 lists the z values for these levels. 


9.1.4 Information theory 


PRB-270 O CH.PRB- 9.4. INFORMATION THEORY] 


This question discusses the link between binary classification, information gain and 
decision trees. Recent research suggests that the co-existence of influenza (Fig. 9.4) and 
COVID19 virus may decrease the survivability of humans infected with the COVID 19 
virus. The data (Table 9.2) comprises a training set of feature vectors with corresponding 
class labels which a researcher intents classifying using a decision tree. 

To study factors affecting COVID19 eradication, the deep-learning researcher collects 
data regrading two independent binary variables; 0, (T/F) indicating whether the patient is 
a female, and 0, (T/F) indicating whether the human tested positive for the influenza virus. 
The binary response variable, y, indicates whether eradication was observed (e.g. eradica- 
tion=+, no eradication=-). 
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FIGURE 9.4: The influenza virus. 


Referring to Table (9.2), each row indicates the observed values, columns (0;) denote 
features and rows (< 0;, yi >) denote labelled instances while class label (y) denotes whether 
eradication was observed. 


Y 01 | 02 
+|T|T 
-|T|F 
+|T|F 
+|T|T 
-I FIT 


TABLE 9.2: Decision trees and the COVID19 virus. 


1. Describe what is meant by information gain. 
Describe in your own words how does a decision tree work. 


Using logs, and the provided dataset, calculate the sample entropy H (y). 


A o N 


What is the information gain IG(X,) = H(y) — H(l0,) for the provided training 
corpus? 
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PRB-271 @ CH.PRB- 9.5. 
What is the entropy of a biased coin? Suppose a coin is biased such that the probability 
of ‘heads’ is p(x) = 0.98. 


1. Complete the sentence: We can predict ‘heads’ for each flip with an accuracy of [__- 


1%. 


2. Complete the sentence: If the result of the coin toss is ‘heads’, the amount of Shannon 
information gained is [___] bits. 


3. Complete the sentence: If the result of the coin toss is ‘tails’, the amount of Shannon 
information gained is [___] bits. 


4, Complete the sentence: It is always true that the more information is associated with 
an outcome, the [more/less] surprising it is. 


5. Provided that the ratio of tosses resulting in “heads” is p(x,), and the ratio of tosses 
resulting in ‘tails’ is p(x,), and also provided that p(x,) + p(x) = 1, what is the 
formula for the average surprise? 


6. What is the value of the average surprise in bits? 


PRB-272 @ CH.PRB- 9.6. 

Complete the sentence: The relative entropy D(p||q) is the measure of (a) [___] between 
two distributions. It can also be expressed as a measure of the (b)[___] of assuming that the 
distribution is q when the (c)[___] distribution is p. 


9.1.5 Feature extraction 


PRB-273 O CH.PRB- 9.7. [FEATURE EXTRACTION] 


A data scientist extracts a feature vector from an image using a pre-trained ResNet34 
CNN (9.5). 
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import torchvision.models as models 


res_model = models.resnet34 (pretrained=True) 


FIGURE 9.5: PyTorch declaration for a pre-trained ResNet34 CNN (simplified). 


He then applies the following algorithm, entitled xxx on the image (9.2). 


CODE 9.2: An unknown algorithm in C++11 


void xxx(std::vector<float>é& arr) { 
float mod = 0.0; 
for (float i: arr) { 
mod += i * i; 
} 
float mag = std::sqrt (mod); 
for (float & i: arr) { 
i /= mag; 
} 
} 


Which results in this vector (9.6): 


0.7766 | 0.4455 | 0.8342 | 0.6324|--- | k = 512 


Values after applying xxx to a k-element FV. 


FIGURE 9.6: A one-dimensional 512-element embedding for a single image from the Res- 


Net34 architecture. 


Name the algorithm that he used and explain in detail why he used it. 
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PRB-274 O CH.PRE- 9.8. 
[FEATURE EXTRACTION] 

The following question discusses the method of fixed feature extraction from layers of the 
VGG19 architecture for the classification of the COVID19 pathogen. It depicts FE principles 
which are applicable with minor modifications to other CNNs as well. Therefore, if you hap- 
pen to encounter a similar question in a job interview, you are likely be able to cope with it 
by utilizing the same logic. 

In (Fig. 9.7), 2 different classes of human cells are displayed; infected and not-infected, 
which were curated from a dataset of 4K images labelled by a majority vote of two expert 
virologists. Your task is to use FE to correctly classify the images in the dataset. 


FIGURE 9.7: A dataset of human cells infected by the COVID19 pathogen. 


Table (9.3) presents an incomplete listing of the of the VGG1 9 architecture. As depicted, 
for each layer the number of filters (i.e. neurons with unique set of parameters), learnable 
parameters (e. g. weights and biases), and FV size are presented. 
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Layer name #Filters #Parameters # Features 


conv4_3 512 2.3M 512 
fc6 4,096 103M 4,096 
fc7 4,096 17M 4,096 
output 1,000 4M. - 
Total 13,416 138M 12,416 


TABLE 9.3: Incomplete listing of the of the VGG19 architecture 


1. Describe how the VGG19 CNN may be used as fixed FE for a classification task. In 
your answer be as detailed as possible regarding the stages of FE and the method used 
for classification. 


2. Referring to Table (9.3), suggest three different ways in which features can be extrac- 
ted from a trained VGG19 CNN model. In each case, state the extracted feature layer 
name and the size of the resulting FE. 


3. After successfully extracting the features for the 4k images from the dataset, how can 
you now classify the images into their respective categories? 


9.1.6 Bayesian deep learning 


PRB-275 @ CH.PRB- 9.9. [BAYESIAN DEEP LEARNING] 


A recently published paper presents a new layer for Bayesian neural networks (BNNs). 
The layer behaves as follows. During the feed-forward operation, each of the hidden neurons 
H,,,n € {1,2, } in the neural network in (Fig. 9.8) may, or may not fire, independently 
of each other, according to a known prior distribution. 
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(MAD 
(1) 0 


FIGURE 9.8: Likelihood in a BNN model. 


The chance of firing, y, is the same for each hidden neuron. Using the formal definition, 
calculate the likelihood function of each of the following cases: 


1. The hidden neuron is distributed according to X ~ B(n, y) random variable and fires 
with a probability of y. There are 100 neurons and only 20 are fired. 


2. The hidden neuron is distributed according to X ~ U(0, y) random variable and fires 
with a probability of y. 


PRB-276 O CH.PRB- 9.10. 
During pregnancy, the Placenta Chorion Test is commonly used for the diagnosis of 
hereditary diseases (Fig. 9.9). 


FIGURE 9.9: Foetal surface of the placenta 


Assume, that a new test entitled the Placenta COVID19 Test has the exact same proper- 
ties as the Placenta Chorion Test. The test has a probability of 0.95 of being correct whether 
or not a COVID19 pathogen is present. It is known that 1/100 of pregnancies result in 
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COVID19 virus being passed to foetal cells. Calculate the probability of a test indicating 
that a COVID19 virus is present. 


PRB-277 @ CH.PRB- 9.11. 

A person who was unknowingly infected with the COVID19 pathogen takes a walk in 
a park crowded with people. Let y be the number of successful infections in 5 independent 
social interactions or infection attempts (trials), where the probability of “success” (infecting 
someone else) is 0 in each trial. Suppose your prior distribution for 0 is as follows: P(@ = 
1/2) = 0.25, P(@ = 1/6) = 0.5, and P(9 = 1/4) = 0.25. 


1. Derive the posterior distribution p(6|y). 


2. Derive the prior predictive distribution for y. 


PRB-278 @ CH.PRB- 9.12. 

The 2014 west African Ebola (Fig. 9.10) epidemic has become the largest and fastest- 
spreading outbreak of the disease in modern history with a death tool far exceeding all past 
outbreaks combined. Ebola (named after the Ebola River in Zaire) first emerged in 1976 in 
Sudan and Zaire and infected over 284 people with a mortality rate of 53%. 


FIGURE 9.10: The Ebola virus. 


This rare outbreak, underlined the challenge medical teams are facing in containing epi- 
demics. A junior data scientist at the centre for disease control (CDC) models the possible 
spread and containment of the Ebola virus using a numerical simulation. He knows that out 
of a population of k humans (the number of trials), x are carriers of the virus (success in 
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statistical jargon). He believes the sample likelihood of the virus in the population, follows a 
Binomial distribution: 


(9.1) 


where: 


n n! 
(+) 7 


As the senior researcher in the team, you guide him that his parameter of interest is y, the 
proportion of infected humans in the entire population. 
The expectation and variance of the binomial are: 


Elyly n) = ny, ,V(yly,n) = ny0 = 7). (9.3) 
Answer the following: 


1. For the likelihood function of the form l.(y) = log L.(y) what is the log-likelihood 
function? 


. Find the log-likelihood function In (L(y)) 
. Find the gradient vector g(7) 
. Find the Hessian matrix H (y) 


. Find the Fisher information 1 (y) 


DH TD BP Q N 


. In a population spanning 10,000 individuals, 300 were infected by Ebola. Find the 
MLE for y and the standard error associated with it. 
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Nothing exists until it is measured. 


— Niels Bohr, 1985 
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10.1 Introduction 


23% T is important at the outset to understand we could not possibly include 
| everything we wanted to include in the first VOLUME of this series. While 
“| the first volume is meant to introduce many of the core subjects in Al, the 

sé! second volume takes another step down that road and includes numerous, 
more advanced subjects. This is a short glimpse into the plan for VOLUME-2 of this 
series. This second volume focuses on more advanced topics in Al 
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